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This  study  was  prompted  by  the  need  for  expert  system  software  to  produce  required 
results  in  real-time  for  systems  like  the  Robotic  Air  Vehicle.  The  expert  system  processing 
speedups  realized  on  serial  machines  due  to  state-saving  match  algorithms  like  Rote  are 
impressive,  but  they  still  fait  short  of  real-time  processing.  This  research  investigation 
focuses  on  parallel  processing  of  such  state-saving  algorithms  with  the  goal  of  achieving 
real-time  expert  system  processing. 
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.-1  b.-tr<id 

This  research  investigation  proposes  a  hypercube  design  (hardware  and  software) 
which  supports  efficient  symbolic  computing  to  permit  real-time  control  of  an  air  vehi¬ 
cle  bv  an  expert  system.  Real-time  processing  requirements  motivate  the  research*  •;  to 
alleviate  common  expert  system  bottlenecks.  Kxamples  of  these  bottlenecks  include  the 
inefficiency  of  symbolic  programming  languages  like  I.isp  and  the  disproportionate  amount 
of  computation  t i mo  commonly  spent  in  the  match  phase  of  the  expert  system  match 
select -act  cycle. 

In  t lie  design  presented  in  this  research  invest igat ion,  fast er  processing  of  t  he  Defense 
Advanced  Research  Projects  Agency's  (DARPA)  Robotic  Air  Which'  (RAY)  expert  system 
software  is  approached  through  1)  fast  production  matching  using  an  expert  system  shell 
which  employs  Roto,  a  state-saving  match  algorithm,  2)  more  efficient  expert  system  shell 
program  execution  due  to  implementation  using  the  ('-Programming  Language  and  H) 
parallel  processing  of  the  RAV  expert  system  production  cycle  using  multiple  copies  of  the 
serial  expert  system  shell.  Lor  this  design,  the  serial  (’-Language  Integrated  Production 
System  (CLIPS)  shell,  which  uses  the1  Rete  match  algorithm,  is  niodil  ed  to  execute  in 
parallel  on  the  iPSC/2  Hypercube.  Although  the  RAV  expert  system  is  the  application  of 
interest  in  this  investigation,  the  parallel  expert  system  shell  is  capable  of  processing  any 
(’Ll PS- syntax  software. 

Speedups  achieved  using  tins  architecture  are  quantified  through  theoretical  timing 
analysis,  and  comparison  with  serial  architecture  performance  results,  with  earlier  parallel 
architectures'  performance  results,  with  best  case  tl  .-oretical  analysis  performance  results, 
and  with  the  "real-time"  goal  performance.  This  performance  quantification  approach 
introduces  the  concept  of  a  performance  spectrum  which  exposes  the  level  of  maturity  of 
RAV  expert  system  processing  in  particular  and  the  level  of  maturity  of  parallel  expert 
system  shell  processing  on  a  multicomputer  in  general. 
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I.  Introduction 

The  feasibility  of  improving  the  performance  of  production  system  software  (par¬ 
ticular!).  expert  system  software)  running  on  a  parallel  architecture  is  an  area  of  current 
interest  in  artificial  intelligence  research.  One  sponsor  of  research  in  this  area,  the  Aii  Force 
Wright  Aeronautical  Laboratories  (AFWAL),  reauires  a  fast  multiprocessor  architecture 
to  process  an  expert  system  capable  of  pilotinga  robotic  air  vehicle  (RAV)  (2X:132(>).  The 
Strategic  Defense  Initiative  Organization  (SDK))  is  another  sponsor  of  faster  expert  sys¬ 
tem  processing  research.  I  his  research  i  ivestigation  is  performed  in  support  of  the  RAV 
expert  system  project  ami  is  a  follow-on  to  the  investigation  by  Captain  Donald  Shaklev 
i  -t7  i.  Some  of  the  area.-  recommended  fo-  future  research  in  Shaklev 's  thesis  are  pursued. 

/  /  System  Hi  qum  mi  nl 

The  concept  of  the  RAV  project  is  to  create  an  unmanned  air  vehicle  capable  of 
autonomous  operation  (see  Appendix  A).  To  develop  this  robotic  capability,  AFWAL 
contracted  Texas  Instruments,  Inc.  to  produce  RAV  software  which,  when  executed,  could 
perform  all  of  the  basic  piloting  skills  as  well  as  various  navigation  and  obstacle  avoidance 
functions  (2x:K{2f*).  AFWAL  and  IT  jointly  chose  an  expert  system  as  the  preferred  RAV 
soft  ware  implement  at  ion  method  for  several  reasons. 

First  and  foremost.  AFWAL  uncovered  through  literature  reviews  that  efforts  to  ap¬ 
ply  traditional  software  implementation  methods  to  intelligent  vehicle  control  had  failed 
due  in  part  to  unmaiiageabdiM  of  code  (37:77).  Second,  researchers  increasingly  are  dis¬ 
covering  that  applications  that  humans  currently  do  better  than  machines  (like  the  task  of 
piloting  an  aircraft )  lend  themselves  to  solution  using  an  artificial  intelligence  approach. 
Because  expert  system  t  ethnology  is  currently  one  of  ihe  most  successful  branches  of  arti¬ 
ficial  intelligence.  Al'WAL  and  I  I  chose  the  expert  svstem  approach  to  the  RAV  software 
const  ruction  /  !  x;  1  i . 
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Unfortunately,  the  RAY  expert  ‘-vstem  has  proven  to  bo  too  compute-intensive  to 
yield  results  in  pud  time  on  any  serial  or  parallel  computer  architecture  (hardware  and 
software)  developed  to  date.  Real-time  means  ‘‘the  time  needed  to  make  a  calculation  has 
to  be  less  than  the  time  from  when  the  need  for  the  calculation  is  recognized  until  the  time 
when  the  response  is  needed  to  take  action"  (31:10).  The  RAV  concept  is  not  feasible  until 
a  computer  architecture  that  produces  real-time  results  processing  the  RAV  expert  system 
ran  he  developed. 


/...’  lit  luh  d  Work 

The  only  parallel  architectures  applied  to  the  RAY  expert  system  were  developed 
by  Shakley  i.37).  1  lu'se  include  parallel  processing  designs  implemented  on  a  network  of 
'lex as  Instrument-  (  III  Kxplorer  l.isp  machines  and  on  a  first  generation  Intel  Personal 
Supercomputer  (iPSCi.  Although  this  study  showed  that  the  increased  parallelism  of  the 
iPSf '  design  could  it:  lact  ; > •  i n  1  u oc'  processing  speedup  compared  to  the  T1  Kxplorer  design, 
the  amount  of  speedup  leah/ed  was  hampered  by  the  effects  of  interproccssor  comm  uni 
ration  overhead  and  load  imbalance.  These  effects  were  caused  mainly  by  the  simple  data 
decomposition  employed  on  the  iPSC  design  (37:72).  Furthermore,  comparing  the  iPSC’ 
design  to  the  i  1  Kxplorer  design  may  have  allowed  factors  unique  to  each  of  these  products 
to  skew  the  analysis.  r hat  is.  the  difference  in  the  observed  processing  speeds  of  the  two 
systems  may  have  resulted  from  factors  other  than  just  the  difference  in  the  number  of 
parallel  processors  applied. 


! .  (  I’voblr  in  nit  ut 

Tim  goal  of  this  r<‘searcli  investigation  is  the  design,  implementation,  and  analysis 
of  a  parallel  processing  art  iiitert ure  implemented  on  a  second  generation  Intel  Personal 
Supercomputer  (il’S< '  j'l  \  and  the  quantification  of  the  processing  speedup  realized  by  tnis 
design  as  applied  to  the  RAY  expert  system. 
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The  following  ar e  tie  objectives  of  this  research: 

•  implement  an  efficient  and  effective  parallel  RAY  expert  system  architecture  on 
the  iPS(’/2  configured  to  use  varying  numbers  of  processors  to  gather  data  on  the 
speed  ups  realizable  through  parallelization 

•  quantify  the  performance  results  for  this  parallel  RAY  design  through  theoretical 
order-of  timing  analysis  and  through  comparisons  with  past  research  results,  with 
serial  architecture  results,  with  best-case  theoretical  analysis  results,  and  with  the 
real  time  goal  performance 


/ .  ■  ')  SrofH 

The  Intel  Persona)  Sujiercomputer  (iPSC/2)  is  the  multicomputer  employed  in  this 
research  investigation  (see  Appendix  B).  This  supercomputer,  which  can  be  configured 
with  up  to  12 x  available  processors,  is  chosen  due  to  availability  as  well  as  to  research 
interest  in  the  performance  of  artificial  intelligence  application  software  executing  on  a 
multicomputer.  Limiting  the  analysis  to  designs  implemented  on  this  one  supercomputer 
eliminates  the  possibility  of  machine-unique  factors  skewing  the  results,  thus  yielding  more 
quantifiable  comparison'-  of  the  performance  differences  between  designs. 

Software  coding  is  done  in  the  C  Programming  Language.  C  was  chosen  over  the 
major  AI  programming  languages.  Lisp  and  PROLOG,  to  facilitate  better  efficiency  and 
portability  (27:190).  C  is  also  the  only  one  of  these  languages  available  for  use  on  the 
AI  IT  iPSC/2  at  the  time  of  this  research  investigation.  Shakley’s  parallel  RAY  design, 
however,  was  implemented  not  in  C  but  in  Lisp  on  the  iPSC.  Shakley's  design  is  not 
reirnplemented  in  C  as  part  of  this  research  investigation,  because  the  potential  execution 
speed  ups  attainable  t  hr<  uigli  reimplementation  in  C  on  the  iPSC/2  are  derived  much  more 
easily  using  theoretical  analysis. 

Because  the  RAY  expert  system  is  the  application  of  interest,  it  is  the  only  application 
to  which  research  de-dens  are  annlied  for  performance  comparison  purposes.  Pertinent 
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RAY  computation  data  is  acquired  from  and  assumed  validated  by  AFYV'AT  RAY  project 
managers. 

1.6  Constraint > 

The  decision  to  dm1  the  il’SC/2  to  implement  an  RAY  parallel  expert  system  drives 
many  program  design  divisions.  Certain  features  cf  iPSC/2  hardware  configuration,  such 
a>  a  local  memory  for  each  processor  nodi',  exclude  many  program  design  options.  For 
example,  algorithms  that  perform  relatively  few  operations  between  synchronizations  (i.e., 
have  a  small  grain  size;  usually  exhibit  poor  efficiency  on  multicomputer.?  that  employ 
multiple  local  memories  a>  does  the  il\S(’/2  (34:'»3).  Consequently,  small  grain  algorithms 
are  excluded  from  consideration  in  this  study.  In  fact,  the  iPSC/2's  multicomputer  design 
prevents  this  research  investigation  from  taking  advantage  of  extensive  production  system 
research  performed  to  date  using  shared-memory  multiprocessors  (14:63). 

The  RAY'  expert  system  is  the  application  of  interest  to  this  research  investigation. 
Although  the  research  methodology  and  design  factors  are  applicable  to  any  parallel  archi¬ 
tecture  research,  the  performance  results  realized  are  unique  to  the  RAV  expert  system 
That  is  to  say.  even  though  the  form  of  the  RAY'  expert  system  is  not  unlike  that  of  any 
other  expert  system  program,  the  application  of  the  architecture  implemented  in  this  inves¬ 
tigation  to  other  expert  systems  will  not  necessarily  produce  the  same  performance  effects. 
This  fact  is  significant  when  analyzing  the  expert  system  shell  execution  speeds  observed 
for  an  application  and  when  using  these  execution  speed  analysis  metrics  to  quantify  the 
merit  of  the  expert  system  shell. 

1.7  Summary 

This  research  investigation  addresses  improving  the  performance  of  production  sys¬ 
tem  software  by  executing  this  software  on  a  parallel  architecture.  Such  an  architecture  is 
applied  to  the  Robotic  Air  Y’ehicle  (RAY')  expert  system.  The  parallel  expert  system  archi¬ 
tecture  is  implemented  on  the  iI’SC/2  supercomputer  using  the  C  programming  language. 
Realization  and  quantification  of  performance  speedups  using  this  paralli  1  architecture  are 
'tm  objectives  of  this  investigation. 
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Chapter  1!  provides  background  information  on  the  key  topics  underlying  this  re¬ 
search  effort,  including  expert  systems  and  parallel  processing.  Chapter  III  describes  the 
Rote  match  algorithm  and  issues  regarding  parallelization  of  Rete.  Chapter  IV  details 
the  fundamental  approach  to  this  research  and  introduces  performance  quantification  us¬ 
ing  a  performance  spectrum.  In  Chapters  V  and  VI,  the  lower  and  upper  performance 
bounds  of  processing  the  RAV  software  are  derived.  Capt  Shakley's  parallel  RAV  design 
and  its  performance  aie  related  in  Chapter  VII.  The  parallel  Rete  expert  system  shell 
design  and  its  application  to  the  RAV  expert  system  arc  presented  in  Chapter  VIII.  The 
many  performance  metrics  produced  in  Chapters  V  through  VIII  are  compared  using  the 
performance  spectrum  method  in  Chapter  IX.  Finally,  the  research  conclusions  and  some 
recommendations  stemming  from  research  findings  are  offered  in  Chapter  X.  Appendices 
are  available  for  those  seeking  deeper  insight  into  the  RAV  project  itself  (Appendix  A), 
parallel  architectures  (Appendix  B).  the  theoretical  and  actual  realizations  of  the  parallel 
Rete  expert  system  shell  (Appendix  C  and  Appendix  D,  respectively),  and  programmers’ 
and  users'  manuals  for  the  expert  system  shell  (Appendix  E  and  Appendix  F,  respectively). 
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II.  Background 


This  chapter  summarizes  some  of  the  underlying  concepts,  uncovered  through  liter¬ 
ature  search,  that  form  the  basis  of  this  research,  including  production  systems,  expert 
systems,  parallel  processing,  communication  overhead,  and  load  imbalance. 

2.1  Production  System 

A  production  system  is  a  pattern  recognition  formalism  based  on  string  replacement 
rules.  An  order  is  imposed  on  these  rules  to  decide  which  applicable  rule  to  apply  next. 
Production  system  computation  proceeds  as  a  string-resolution-based  search.  A  control 
strategy  is  used  with  string-modifying  production  rules  to  model  certain  types  of  human 
problem-solving  behavior  (30:48). 

The  major  elements  of  a  production  system  are  a  global  database  of  facts,  a  set  of 
production  rules,  and  a  control  system. 

A  fact  is  an  assertion  which  represents  a  specific  item  of  knowledge.  A  fact  is  generally 
of  the  following  abstract  form  (14:8): 

(  <  object  ><  attribute 1  ><  value  1  ><  attribute 2  ><  value 2  >  . . .) 

Written  in  this  form,  a  fact  is  a  parenthesized  list  consisting  of  a  constant  symbol, 
commonly  called  the  object,  and  zero  or  more  attribute-value  pairs.  An  object  represents 
an  entity  within  a  problem's  domain  that  is  of  significance  to  the  solution  of  that  problem. 
Ari  attribute  represents  a  specific  characteristic  of  its  associated  object.  A  value  is  the 
parameter  instantiated  for  a  given  attribute  (36:75). 

The  global  database  of  facts,  also  called  working  memory  (WM),  is  the  central  data 
structure  used  by  a  production  system.  Because  facts  are  held  in  working  memory,  the 
facts  are  often  referred  to  as  working  memory  elements  (WME).  At  the  start  of  production 
system  computation.  WM  contains  a  set  of  initial  facts  about  the  problem  domain.  As 
production  system  computation  proceeds,  facts  are  added  and/or  deleted  from  WM. 
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A  proauction  rult  is  expressed  as  strings  that  represent  general  knowledge  about  a 
particular  subject  area.  A  production  rule  eenerally  apDoars  in  the  form  of  an  “IF  condition 
THEN  action”  implication.  By  convention,  the  condition  part  of  a  rule  is  called  the  Left 
Hand  Side  (LHS)  and  the  action  part  of  a  rule  is  called  the  Right  Hand  Side  (RHS).  The 
LHS  is  consists  of  one  or  more  condition  dements  (CEs)  that  are  compared  to  the  actual 
state  of  referenced  facts  in  \VM  at  a  given  point  in  computation.  The  RHS  consists  of  an 
unconditional  sequence  of  actions  which  can  add  facts  to  and/or  delete  facts  from  \VM. 
The  following  is  a  sample  production  rule  (14:12): 

(rule  sample  : 

if  ((object'!  attribute 1  15  attribute  2  y) 

( object  A  attribute 1  y) 

) 

then 

(add  objects  attmbutel  12)) 

This  production  rule,  named  "sample",  consists  of  two  condition  elements  in  its  LHS 
and  one  action,  an  add  to  \VM.  in  its  RHS.  Note  that  this  production  rule  form  can  be 
generalized  in  the  following  first-order  predicate  calculus  form: 
if  ((I\(xx.r2  i)  + 

(P2(x:f,x,)) 

) 

then 

(PAx  1,13)) 

In  this  example,  both  predicate  F\  and  predicate  P2  must  be  true  for  predicate  P3 
to  be  true. 

The  production  rules,  known  as  the  production  memory  (PM),  are  matched  against 
the  global  database  of  facts.  All  rules  in  PM  can  match  against  and  alter  any  facts  in  WM. 
The  LHS  of  each  rule  in  PM  is  either  satisfied  or  is  not  satisfied  by  one  or  more  facts  in 
the  global  database  in  WM  at  any  given  discrete  step  in  production  system  computation. 
If  a  rule’s  LHS  is  satisfied,  that  rule  may  be  applied.  Application,  or  firing,  of  a  rule  adds 
facts  to  and/or  deletes  facts  from  WM  as  dictated  by  that  rule’s  RHS. 


For  example,  the  production  rule  “sample"  shown  above  is  satisfied  when  there  exists 
one  or  more  facts  in  \YM  that  meet  all  of  the  following  conditions:  i)  the  fart's  object 
matches  object 2,  ti)  the  fact's  current  value  y  for  attribute  1  is  “15”  and  mj  the  fact's 
current  value  y  for  attrilmti  J  is  the  same  as  the  current  value  y  saved  for  attributcl  in  some 
other  fact  whose  object  matches  objectf  If  all  three  of  the  above  conditions  exist  in  WM, 
the  "sample"  rule  is  then  eligible  to  fire,  adding  a  new  fact  with  object  objects  and  value 
"12"  for  attribute  1  to  WM. 

It  is  possible  during  production  system  computation  that  more  than  one  rule's  LHS 
is  satisfied  by  the  state  of  the  facts  in  WM.  The  list  of  satisfied  production  rules  in  PM 
i*  commonly  called  the  conflict  set.  The  control  system  chooses  which  among  all  of  the 
satisfied  rules  is  to  be  applied.  The  choice  of  which  rule  to  fire  may  be  based  on  some 
firing  priority  assigned  to  each  rule,  on  some  characteristic  of  the  rule  string  itself,  or  on 
some  arbitrary  ordering.  The  control  system  may  employ  an  irrevocable  control  strategy 
or  a  tentative  control  strategy.  In  an  irrevocable  strategy,  a  satisfied  rule  is  selected  and 
applied  without  provision  for  reconsideration  later.  In  a  tentative  strategy,  a  satisfied 
rule  is  selected  and  applied,  but  with  provision  made  to  return  later  to  that  point  in 
the  computation  to  apply  some  other  satisfied  rule  instead.  The  control  system  halts 
computation  when  a  predefined  termination  condition,  or  goal  condition,  exists  in  the 
current  contents  of  WM  (30:1*). 

The  basic  production  system  algorithm  is  illustrated  in  Figure  2.1.  This  algorithm 
is  executed  as  a  Match-Select- Act  cycle,  usually  in  the  following  order  (11:36): 

1.  Match  -  evaluate  the  LlISs  of  the  production  rules  to  determine  which  are  satisfied 
given  the  current  contents  of  WM 

2.  Select  -  choose1 2 3 4  one  production  rule  with  a  satisfied  LHS  from  the  conflict  set;  if  no 
production  rules  have  satisfied  LHSs,  return  control  to  the  user 

3.  Act  -  perform  the  actions  specified  in  the  RHS  of  the  selected  production  rule 

4.  If  a  termination  condition  is  detected,  then  return  control  to  the  user;  otherwise  go 
to  Step  1  (  Match  ) 
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Procedure  PRODUCTION 

1.  DATA  —  initial  fact  database 

2.  until  DATA  satisties  the  termination  condition,  do; 

bejii  n 

select  some  rule  R  in  the  set  of  rules 

that  can  be  applied  to  DATA 

DATA  —  result  of  applying  R  to  DATA 

3.  end 


Figure  2.1.  Production  System  Algorithm  (30:21) 


2.2  Expert  System 

One  class  of  production  systems  is  that  of  expert  systems.  Expert  systems  are  formal 
computing  systems,  or  programs,  that  use  the  production  system  paradigm  to  offer  advice 
or  solve  problems  by  reasoning  with  bodies  of  knowledge  highly  specific  to  a  particular 
domain  (4:105).  The  bodies  of  knowledge  are  generally  extracted  from  human  experts 
in  the  domain.  Using  this  knowledge  base,  the  expert  system  attempts  to  emulate  the 
experts'  methodology  and  performance  toward  solving  a  problem  (27:291). 

Knowledge  engineering  is  the  interdisciplinary  AI  field  concerned  with  the  extraction 
of  knowledge  from  domain  experts  and  the  transfer  of  this  knowledge  into  hardware  and 
software  representation.  After  the  knowledge  engineer  has  developed  the  basic  expert 
system,  the  acquired  expertise  is  refined  through  a  process  of  giving  the  system  example 
problems  to  solve.  Domain  experts  criticize  the  system’s  behavior  and  make  any  required 
changes  or  modifications  to  its  knowledge.  This  process  is  repeated  until  the  system  has 
achieved  the  desired  level  of  performance  (27. 1G).  Because  the  knowledge  engineering  task 
is  difficult  and  expensive,  expert  systems  typically  emulate  problem  solving  over  a  very 
limited  domain. 

The  principal  components  of  a  rule-based  expert  system  are  a  knowledge  base,  an  in- 
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ft  rence  engine ,  and  a  man-machine  interface  (see  Figure  2.2).  The  knowledge  base  contains 
the  global  database  of  facts  and  the  production  rules  that  embody  an  expert’s  expertise. 
The  inference  engine  is  tin*  control  system  that  serves  as  a  reasoning  mechanism  and  search 
controller.  It  is  the  inference  engine  that  performs  the  match-select-act  cycle.  These  two 
expert  system  components  represent  the  major  elements  present  in  any  production  sys¬ 
tem.  The  man-machine  interface  simply  produces  dialog  (string,  graphics,  etc.)  between 
the  computer  and  the  user  (18:8-13). 


Figure  2.2.  Expert  System  Components  (18) 


2. .y  Sjxtdup  and  Piu-aliil  Pi-recessing 

Compute-intensive  applications,  such  as  the  RAV  expert  system  (28),  require  pro¬ 
cessing  beyond  the  performance  ability  of  conventional,  single-processor  machines.  Ishida 
and  Stolfo  suggest  that  although  speed  improvements  in  single-processor  machines  have 
occurred,  “further  speed  improvements  are  required  for  very  large  [expert]  systems  with 
severe  time  constraints.”  (23:568)  Parallel  processing ,  which  means  applying  several  pro¬ 
cessors  to  run  the  solution  algorithm  for  a  single  problem,  is  one  approach  to  achieving 
such  speed  improvements. 

The  degree  to  which  parallel  processing  improves  processing  speed  depends  upon  the 
efficient  use  of  available  processors.  Two  main  obstacles  to  achieving  peak  performance 
using  parallel  processing  are  communication  overhead  and  load  imbalance  (37:72).  One  key 
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to  overcoming  botli  obstacles  is  the  proper  choice  of  a  problem  decomposition  approach. 
A  decomposition  algorithm  divides  the  overall  expert  system  into  independent  subunits, 
or  tasks,  each  of  which  is  assigned  to  one  of  the  available  processors  for  execution. 

2. 4  Communication  Overhead 

Communication,  or  the  passing  of  required  information  between  processors,  seriously 
degrades  expert  system  performance  because  the  overhead  can  become  so  intense  that  “the 
[processors]  spend  more  of  their  time  communicating  than  computing. ”  (6:72)  Obviously, 
chosing  the  most  current  technology  parallel  processor  that  performs  interprocessor  com¬ 
munication  as  fast  as  possible  is  one  way  of  reducing  communication  time.  An  algorithmic 
approach  to  reducing  the  detrimental  effects  of  communication  overhead  is  to  decompose 
and  distribute  the  expert  system  tasks  in  such  a  way  that  the  interprocessor  communica¬ 
tion  time  incurred  by  parallelizing  the  expert  system  is  less  than  the  computation  time 
saved  through  parallelization  of  the  expert  system.  The  benefits  of  a  “good”  expert  system 
decomposition  are  realized  regardless  of  the  particular  parallel  processor’s  communication 
speed.  But  an  extremely  fast  interprocessor  communication  capability  cannot  be  expected 
to  always  overcome  the  effects  of  a  poor  decomposition. 

1  he  number  of  processors  over  which  an  expert  system’s  tasks  are  decomposed  also 
affects  communication  overhead.  A  phenomenon  called  the  Amdahl  effect  dictates  that  any 
parallel  algorithm  shows  constrained  speedup  if  there  is  not  enough  work  to  be  done  by  the 
number  of  processors  available  (34:60).  The  Amdahl  effect  suggests  there  exists  an  optimal 
number  of  processors  upon  which  a  parallel  program  can  be  run.  Applying  more  than  this 
optimal  number  of  processors  adds  communication  overhead  that  overcomes  some  of  the 
computation  time  savings  of  parallelizing  the  system.  Figure  2.3  shows  how  processing  can 
slow  down  when  more  than  the  optimal  number  of  processors  are  applied  to  a  problem. 
Proponents  of  parallel  processing  hasten  to  point  out  that  Amdahl’s  effect  occurs  under 
the  assumption  that  some  number  of  necessarily  sequential  operations  exist  to  interrupt 
the  parallel  execution  of  an  algorithm.  Hence,  Amdahl's  argument  serves  as  a  way  of 
determining  whether  an  algorithm  is  a  good  candidate  for  parallelization,  rather  than  as 
a  provable  limit  to  speedup  for  all  algorithms  (34:19). 
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2.5  Load  Imbalance 


An  imbalance  of  the  task  load  among  the  processors  also  severely  degrades  perfor¬ 
mance.  The  goal  is  to  ensure  that  "the  tasks  being  executed  [are]  uniformly  distributed 
amongst  the  various  processing  nodes  in  a  manner  which  maximizes  resource  utilization  to 
enhance  the  total  throughput  of  the  system."  (2:189)  Achieving  load  balance,  like  achiev¬ 
ing  low  communication  overhead,  also  depends  on  choosing  a  good  problem  decomposition. 
A  good  decomposition  algorithm  maps  the  expert  system  tasks  to  the  available  processors 
in  such  a  way  as  to  keep  as  many  of  the  processors  busy  doing  useful  work  as  possible. 

There  are  two  general  task  allocation  policies:  static  decomposition  and  dynamic 
decomposition.  Static  decomposition  assumes  that  tasks  and  their  precedence  relations 
are  known  before  execution.  Dynamic  decomposition  assumes  that  tasks  are  generated 
during  program  execution.  The  advantage  of  static  decomposition  is  that  it  allows  the 
prealloration  of  tasks  to  processors,  thus  reducing  the  amount  of  interprocessor  commu¬ 
nication.  The  advantage  of  dynamic  decomposition  is  that  it  makes  it  easier  to  keep  all 
the  nroressors  busy  because  tasks  requiring  processing  are  assigned  to  the  first  available 
processor  ( 34:62).  But  dynamic  decomposition  adds  communication  overhead  to  distribute 
tasks  to  available  processors. 

2.6  Summary 

A  production  system  consists  of  a  global  database  of  facts  and  a  set  of  production 
rules  (constituting  the  knowledge  base),  and  a  control  system  (inference  engine).  The  basic 
production  system  execution  algorithm  is  the  Match-Select-Act  cycle.  The  rule  LHSs  are 
matched  against  the  current  farts  in  the  knowledge  database,  one  of  the  satisfied  rules  is 
selected  for  firing,  and  the  It  1 1 S  actions  of  the  selected  rule  are  performed  to  update  the  fact 
database.  An  expert  system  applies  the  production  system  paradigm  to  problem  solving 
using  knowledge  that  is  specific  to  a  particular  domain.  Parallel  processing,  which  means 
applying  several  processors  to  the  solution  of  a  single  problem,  is  employed  to  speed  up 
processing  of  a  wide  range  of  applications,  including  expert  systems.  Two  main  obstacles  to 
realizing  speedup  through  parallelization  are  communication  overhead  and  load  imbalance. 


III.  Production  System  Performance  Improvement  Concepts 


I 

This  chapter  offers  some  production  system  parallelism  concepts  and  introduces  the 
serial  Rete  match  algorithm.  The  parallelization  of  the  Rete  algorithm  lays  the  groundwork 
for  a  new  distributed  processor  parallel  RAY  expert  system  design. 

3.1  Potential  Production  System  Parallelism 

As  discussed  in  Chapter  Ii.  the  three  steps  that  are  repeatedly  performed  to  exe¬ 
cute  a  production  system  algorithm  are  match,  select,  and  act.  Figure  3.1  illustrates  the 
information  flow  among  these  three  steps.  Note  that  a  synchronization  point  exists  af¬ 
ter  the  select  step  and  before  the  subsequent  act  step.  This  synchronization  point  has 
a  serializing  effect  on  the  match-select-act  cycle.  The  select  step  must  finish  completely 
before  the  next  production  rule  to  fir*'  can  be  determined  and  its  RHS  evaluated.  Without 
this  synchronization,  a  potential  race  condition  exists  in  which  the  \VM  change  inputs 
to  a  match-select-act  cycle  may  be  corrupted  by  outputs  of  that  same  cycle.  No  other 
&  mandatory  synchronization  points  exist  in  the  cycle. 


Figure  3.1.  Match-Select- Act  Flow  Graph  (14:46) 


This  feature  of  production  systems  suggests  areas  of  potential  parallelism.  For  ex¬ 
ample,  it  is  possible  to  use  parallelism  within  the  match  step,  within  the  select  step,  and 
within  the  act  step.  It  is  further  possible  to  overlap  the  processing  performed  within  the 
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match  step  and  the  select  step  of  the  same  cycle  and  the  processing  performed  within  the 
act  step  of  one  cycle  and  the  match  step  of  the  next  cycle  (14:45).  These  potential  inter¬ 
step  overlaps  are  represented  by  the  queue  symbols  in  Figure  3.1.  But  as  noted  above,  it  is 
not  possible  to  overlap  tie'  processing  within  the  select  step  of  one  cycle  and  the  subsequent 
act  step. 

,'>.2  Rfti  Match  Algorithm 

All  of  the  information  presented  in  this  section  is  taken  from  ( 15)  except  items  noted 
from  other  sources. 

The  most  time  consuming  step  in  the  execution  of  production  systems  is  the  match. 
Matching  is  a  pattern  recognition  activity  which  involves  matching  the  left  hand  sides 
il.HSs)  of  all  productions  against  all  facts  in  working  memory  (WM).  Even  with  special¬ 
ized  algorithms,  the  match  step  constitutes  around  90Sf  of  the  interpretation  time  (13:4). 
Consequently,  speeding  up  the  match  step  in  production  systems  is  an  aria  of  intense 
research . 

The  Ret i >  (pronounced  "reel")  algorithm  is  among  the  most  efficient  algorithms  for 
match  yet  developed .  To  achieve  t  his  efficiency,  Rete  exploits  two  features  common  to  most 
production  systems:  first .  only  a  small  fraction  of  working  memory  typically  changes  every 
match  select  act  cycle;  and.  second,  similar  condition  elements  often  appear  repeatedly 
among  the  productions  in  production  systems.  Rete  exploits  the  first  feature  by  storing 
reMilts  of  match  from  previous  cycles  and  using  them  in  subsequent  cycles.  Rete  exploits 
the  second  feature  by  recognizing  condition  elements  referenced  in  multiple  productions 
and  performing  common  tests  only  once  ( 13:4). 

The  Rete  algorithm  uses  an  augmented  discrimination  network  compiled  from  the 
I.HSs  of  the  productions  to  perform  the  match.  In  fact,  the  name  “Rete"  comes  from  the 
latin  word  fc r  “network."  (29)  Figure  3.2  shows  such  a  network  for  productions  pi  and  p2 
which  appear  in  the  top  part  of  the  figure.  In  this  figure,  lines  have  been  drawn  between 
nodes  to  indicate  the  paths  along  which  information  flows.  Information  flows  from  the  top 
node  down  along  the-c  paths. 


To  generate  the  network  for  a  particular  expert  system  application,  the  production 
rules  are  parsed  and  "t i undated”  by  a  network  compiltr.  Note  that  the  Rete  network 
compiler  is  not  a  compiler  in  the  conventional  sense  of  computing.  The  Rete  compiler 
builds  a  network  which  serves  as  the  data  structure  acted  upon  by  the  inference  engine 
during  execution  of  the  expert  system. 

The  network  compiler  proceeds  first  with  the  individual  condition  elements  in  the 
rules'  I.llSs.  for  each  condition  element,  the  compiler  chains  together  test  nodes  that 
check  the  following: 

•  if  the  attribute-,  in  the  condition  element  that  have  a  constant  as  their  value  are 
sat  isfied 

•  if  the  attributes  in  t  he  condition  element  that  are  related  to  a  constant  by  a  predicate 
are  satisfied 

•  if  two  occurrences  of  the  same  variable  within  the  condition  element  are  consistently 
bound  (i.e.  working  memory  elements  with  the  same  value  for  the  specified  attribute 

exist ) 

liach  node  in  the  chain  performs  one  such  test.  These  three  tests  are  called  intra- 
ffiudihou  tests  because  they  correspond  to  individual  condition  elements.  In  Figure  3.2, 
the  nodes  with  a  single  predecessor  (near  the  top  of  the  network)  are  the  ones  that  are 
concerned  with  individual  condition  elements. 

Once  the  network  compiler  has  finished  with  the  individual  condition  elements,  it 
adds  nodes  that  check  for  consistency  of  variable  bindings  across  the  multiple  condition 
elements  in  the  LHS.  These  tests  are  called  intcr-condition  tests  because  they  refer  to 
multiple  condition  elements.  The  nodes  with  two  predecessors  are  the  ones  that  check  for 
consistency  of  variable  bindings  between  condition  elements. 

Finally,  the  compiler  adds  a  special  terminal  node  to  represent  the  successful  match¬ 
ing  of  the  production  to  which  this  part  of  the  network  corresponds.  The  terminal  nodes 
are  at  the  bottom  of  the  figure. 


Note  that  when  two  I.IISs  require  identical  nodes,  the  compiler  builds  a  shared  set 
of  nodes  in  the  network  rather  than  duplicate  nodes.  This  feature  of  Rete  ensures  that  the 
same  test  is  not  performed  repeatedly  for  multiple  rules  within  a  single  match  step. 

To  avoid  performing  all  of  the  same  tests  completed  during  the  previous  match  step, 
the  Rete  algorithm  stores  the  result  of  a  match  with  working  memory  as  state  within  the 
network.  Only  changes  made  to  the  working  memory  by  the  most  recent  production  firing 
have  to  be  processed  every  match-select-act  cycle.  That  is,  the  input  to  the  Rete  network 
consists  of  the  most  recent  changes  to  the  working  memory.  These  changes  filter  through 
the  network  and.  where  relevant,  the  state  stored  in  the  network  is  updated.  Due  to 
the  Rete  algorithm's  state-saving  feature,  the  amount  of  effort  expended  by  the  matcher 
depends  primarily  on  the  rate  of  change  of  working  memory  rather  than  the  absolute 
size  of  working  memory  (11:37).  The  output  of  the  network  consists  of  a  specification 
of  changes  to  the  list  of  rules  eligible  to  be  fired.  This  list  of  rules  is  called  the  conflict 
set  because  only  one  of  these  rules  may  be  allowed  to  fire  under  the  production  system 
paradigm.  Consequently,  the  rules  can  be  said  to  be  in  conflict  over  the  right  to  be  fired 
in  the  current  cycle. 

The  objects  that  are  passed  between  nodes  in  the  network  are  called  tokens ,  which 
consist  of  a  tag  and  a  list  of  working  memory  elements.  The  tag  can  be  either  a  -f, 
indicating  that  an  element  has  been  added  to  the  working  memory,  or  a  -  .  indicating  that 
an  element  has  been  deleted  from  working  memory.  No  special  tag  for  working  memory 
element  modification  is  needed  because  a  modify  is  treated  as  a  delete  followed  by  an 
add.  The  list  of  working  memory  elements  associated  with  a  token  corresponds  to  the 
permutation  of  those  elements  that  the  system  is  trying  to  match  or  has  already  matched 
against  a  subsequence  of  condition  elements  in  the  LHS. 

The  discrimination  network  produced  by  the  Rete  network  compiler  consists  of  a 
number  of  the  following  types  of  nodes: 

•  Root  Node:  This  node  forms  the  root  of  the  discrimination  net.  It  broadcasts  tokens 
corresponding  to  any  change  in  the  working  memory  to  all  its  successor  nodes.  In 
Figure  3.2,  the  root  node  is  shown  at  tlk  top. 
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•  Constant  Tfst  (t -const)  Nodes:  These  nodes  are  used  in  the  network  to  perform 
intra-condition  tests,  for  example,  to  check  if  condition  attributes  that  have  constant 
symbols  or  numbers  as  their  values  are  satisfied.  Each  t-const  node  checks  for  one 
feature.  Whenever  the  token  arriving  at  the  input  of  a  t-const  node  satisfies  the 
associated  test,  it  is  passed  on  V  the  successors  of  the  t-const  node.  If  the  token 
does  not  satisfy  the  test,  it  is  not  passed  on  to  the  successors.  In  Figure  3.2,  the 
nodes  towards  the  top  of  t lie  network  are  t-const  nodes.  Because  the  second  condition 
element  of  production  pi  is  similar  to  the  first  condition  element  of  production  p2, 
t-const  nodes  “Class=C2"  and  “attrl  =  15"  are  shared  in  the  network  for  rules  pi  and 

p2. 

•  Alpha  Memory  (alpha-mem)  Nodes:  If  a  working  memory  element  satisfies  all  intra- 
condition  tests  for  a  condition  element,  the  working  memory  element  is  said  to  par¬ 
tially  match  the  condition  element.  Note  that  it  may  not,  as  yet,  satisfy  all  the 
inter-condition  tests.  Tokens  corresponding  to  working  memory  elements  that  par¬ 
tially  match  a  condition  element  are  stored  in  the  alpha-mem  node  for  that  condition 
element.  When  a  token  arrives  at  an  alpha-mem  node  with  a  +  tag,  the  token  is 
stored  in  the  alpha-mem  node  and  a  copy  of  the  token  is  passed  to  the  node's  suc¬ 
cessors.  If  the  tag  is  -.  a  corresponding  token  with  a  +  tag  must  already  exist  in  the 
alpha-mem.  The  corresponding  +  token  is  deleted  from  the  alpha-mem  node  and 
the  incoming  token  is  passed  down  to  the  successors  of  the  alpha-mem  node.  If  two 
condition  elements  in  the  same  or  different  productions  have  exactly  the  same  tests 
for  a  successful  partial  match,  the  network  compiler  generates  a  shared  alpha-mem 
node  for  the  two.  This  sharing  of  an  alpha-mem  node  can  be  seen  in  Figure  3.2. 

•  Beta  Memory  (be la-mem)  Nodes:  Ji  ot  as  alpha-mem  nodes  store  tokens  that  partially 
match  individual  condition  elements,  so  beta-mem  nodes  store  tokens  that  partially 
match  two  or  more  condition  elements  in  the  LHS  of  a  production.  The  list  of  working 
memory  elements  in  beta-mem  tokens  has  length  two  or  more.  The  response  of  beta- 
mem  nodes  to  the  arrival  of  tokens  at  their  inputs  is  exactly  the  same  as  that  of 
alpha-mem  nodes.  The  beta-mem  nodes  form  the  left  input  of  and-nodes  and  not- 
nodes. 
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•  And- Nodes:  The  and- nodes  are  the  first  of  the  two-input  node  types.  The  primary 
function  of  an  and-node  is  to  check  for  consistency  of  variable  bindings  between  the 
partially  matched  tokens  it  receives  on  its  left  and  right  inputs.  The  right  input  of 
an  and-node  always  comes  from  an  alpha-mem  node,  while  its  left  input  can  come 
from  an  alpha-mem  or  a  beta-mem  node.  Whenever  a  token  arrives  at  the  left  input 
of  an  and-node.  the  and-nod«.  compares  the  incoming  token  to  each  token  stored 
in  the  mem-node  connected  to  its  right  input  to  check  if  they  are  consistent.  For 
every  right-token  which  is  consistent  with  the  left-token,  a  new  token  is  constructed 
and  sent  down  to  the  successor  nodes.  The  new  token  has  the  same  tag  as  that  of 
the  left-token,  and  the  list  of  working  memory  elements  is  the  concatenation  of  the 
working  memory  element  lists  for  the  left  and  right  tokens.  The  case  when  a  token 
arrives  at  the  right  input  of  an  and  node  is  processed  exactly  as  above,  with  left  and 
right  interchanged. 

•  Xot-.Xodr  The  not -nodes  are  the  second  of  the  two-input  node  types.  They  also 
have  a  left  and  a  right  input.  The  not-nodes  are  used  by  the  network  to  implement 
negated  condition  elements.  Their  functionality  differs  from  that  of  and-nodes  only 
in  minor  ways.  One  difference  is  that  not-nodes  keep  reference  counts  with  tokens  in 
left  memory  to  find  when  there  are  no  tokens  in  the  right  memory  that  are  consistent 
with  them. 

•  fXoduction  Nodts  (p-nodrs):  These  are  the  terminal  nodes  in  the  network.  There  is 
one  such  node  associated  with  each  production.  Whenever  a  token  with  a  +  tag  flows 
into  a  p-node,  it  adds  an  instantiation  (corresponding  to  the  token)  of  the  associated 
production  into  the  confict  set.  The  arrival  of  a  token  with  a  -  tag  leads  to  the 
deletion  of  the  corresponding  production  instantiation  from  the  conflict  set. 

•  Other  Nodes:  Other  than  the  node  types  mentioned  above,  the  network  uses  two 
more  node  types.  These  are  the  Two-Nodes  and  the  Any-Nodes.  The  two-node  is 
used  as  a  place  filler  in  some  circumstances,  and  the  any-node  is  used  when  the  value 
of  an  attribute  is  to  be  one  of  a  number  of  alternatives. 

The  match  step  in  a  Kete  network  interpreter  can  itself  be  divided  into  two  parts:  the 
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selection  phase ,  which  consists  of  evaluating  the  t-const  nodes;  and  the  state -ujxlale  phase , 
which  consists  of  evaluating  the  alpha-mem  nodes,  beta-mem  nodes,  and-nodes,  riot-nodes, 
and  p-nodes.  Comparing  these  two  phases,  about  75%;  to  95%  of  the  total  processing  time 
is  spent  performing  the  state-update  phase  (14:47). 


3.3  Parallelizinq  Rcte 

The  Rete  match  algorithm  is  suitable  for  parallel  implementations.  The  data-flow 
like  organization  of  the  Rete  network  makes  it  possible  to  evaluate  the  activations  of  dif¬ 
ferent  nodes  in  the  network  in  parallel.  It  is  also  possible  to  evaluate  multiple  activations 
of  the  same  node  in  parallel  and  to  process  multiple  changes  to  working  memory  in  parallel 
( 14:20).  Of  the  many  sources  of  production  system  parallelism,  the  following  three  arc  par¬ 
ticularly  important  in  the  parallelization  of  Rete  and,  specifically,  of  the  state-update  phase 
of  the  Rete  match  step:  production  parallelism,  node  parallelism ,  and  action  parallelism. 

Production  parallelism  is  accomplished  by  dividing  the  productions  in  a  program 
into  several  partitions  and  performing  the  match  for  each  of  the  partitions  in  parallel. 
Figure  3.3  illustrates  the  partitioning  of  productions  of  an  expert  system.  Production 
partitioning  is  a  static  task  decomposition  approach.  Consequently,  the  main  advantage  of 
using  production  parallelism  is  that  no  communication  is  required  between  the  processes 
performing  match  for  different  productions  or  different  partitions.  That  is,  it  is  large- 
grain  parallelism.  Disadvantages  of  production  parallelism  are  that  it  is  limited  by  1)  the 
typically  small  number  of  productions  affected  per  change  to  working  memory,  2)  the  large 
variance  in  the  amount  of  processing  required  by  the  affected  productions,  and  3)  the  loss 
of  sharing  in  the  overall  Rete  network  as  a  result  of  production  partitioning  (14:48-49). 

Node  parallelism ,  which  is  unique  to  the  Rete  algorithm,  means  that  activations  of 
different  two-input  nodes  in  the  Rete  network  are  evaluated  in  parallel.  An  advantage  of 
node  parallelism  is  that  t-oth  activations  of  two-input  nodes  belonging  to  different  produc¬ 
tions  (corresponding  to  production  parallism)  and  activations  of  two-input  nodes  belonging 
to  the  same  production  (resulting  in  extra  parallelism)  are  processed  in  parallel.  Node  par¬ 
allelism  is  implemented  at.  a  finer  granularity  than  production  parallelism  to  1)  reduce  the 
effect  of  large  variance  in  the  amount  of  affected  productions  processing,  and  2)  to  recover 
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Figure  3.3.  Production  Parallelism  (17:48) 


some  of  the  sharing  lost  in  the  overall  Rete  network  when  using  production  parallelism. 
This  fine  granularity,  however,  leads  to  increased  communication  requirements  between 
processes  evaluating  the  nodes  in  parallel  (14:51). 

Action  parallelism  refers  to  the  concurrent  processing  of  changes  made  to  working 
memory  when  a  production  fires.  Action  parallelism  enhances  the  speedup  obtainable 
using  production,  node,  and  other  forms  of  parallelism  (14:54). 

A  significant  amount  of  research  has  been  performed  toward  implementing  paral¬ 
lel  Rete  on  multiprocessors  (14,  25).  Fine-grain  node  parallelism,  enhanced  with  action 
parallelism  wherever  possible,  is  the  preferred  implementation  method  when  using  a  mul¬ 
tiprocessor  architecture.  This  method  is  attractive  because  multiprocessors  have  the  ad¬ 
vantage  of  shared  memories  (see  Appendix  B),  which  offsets  much  of  the  cost  of  increased 
communication  associated  with  node  parallelism  (14:58). 

In  a  typical  multiprocessor  design,  a  single  copy  of  the  Rete  network  is  held  in  shared 
memory.  The  match  is  broken  into  fairly  small  units  of  work  called  tasks,  where  a  task 
is  an  independently  schedulable  unit  of  work  that  may  be  executed  in  parallel  with  o*her 
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tasks  (17:103).  Each  task  is  represented  by  a  token.  This  token  is  essentially  the  same 
as  that  described  for  the  sequential  Rete  matcher,  except  that  it  has  two  extra  items  of 
information:  the  addrrss  of  the  node  to  which  the  token  is  to  be  sent;  and.  if  that  node 
is  a  two-input  node,  an  indication  of  whether  to  send  it  to  the  left  or  right  input.  The 
list  of  tokens  that  are  awaiting  processing  is  held  in  a  central  data  structure  called  a  task 
qui  lie.  When  a  processor  in  the  multiprocessor  becomes  available,  it  removes  a  token  from 
the  task  queue.  If,  during  processing  of  a  token,  new  tokens  are  to  be  sent  out.  these 
are  entered  into  the  task  queue  for  subsequent  processing.  See  (17)  for  details  of  a  highly 
successful  parallel  production  system  implementation  employing  Rete  on  a  shared-memory 
multiprocessor. 

Unlike  the  parallel  Rete  research  performed  on  multiprocessors,  implementation  of  a 
parallel  Rete  matcher  on  a  multicomputer  architecture  remains  relatively  unexplored  for 
two  basic  reasons.  First,  the  most  natural  approach  to  implementing  production  systems 
on  a  multicomputer  is  production  parallelism,  enhanced  with  action  parallelism.  Hut 
preliminary  simulation  analysis  of  parallel  processing  using  only  production  parallelism  and 
action  parallelism  indicates  that  the  speedups  attainable  are  very  low  (32:1)2).  Second,  the 
node  parallelism  approach  was  shown  theoretically  and  through  simulation  to  be  superior  to 
production  parallelism.  But  the  fine  granularity  of  node  parallelism  adds  communications 
costs  that  may  restrict  the  class  of  suitable  architectures  to  shared-memory  multiprocessors 
(Mm*). 

Limited  theoretical  analysis  of  a  parallel  architecture  which  implements  Rete  in  an 
object-oriented  manner  on  a  multicomputer  has  been  performed  (16).  This  research  draws 
heavily  from  work  performed  on  multiprocessor  designs.  At  the  time  of  this  research  investi¬ 
gation,  ilie  only  parallel  multicomputer  implementation  of  a  production  system  interpreter 
that  employs  Rete  is  II  CLIPS,  developed  for  the  FLEX/32  (MIMD)  multicomputer  (33). 

■i.\  Summary 

Production  systems  lend  themselves  to  parallel  execution.  The  only  mandatory  syn¬ 
chronization  point  in  the  match-select  act  cycle  exists  after  the  select  step  and  before  the 
subsequent  act  step.  Speeding  up  processing  of  the  match  step  is  critical,  as  it  typically 
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constitutes  around  90%  of  the  production  cycle  time.  The  Rete  match  algorithm  speeds 
match  time  in  two  ways:  first,  Rete  saves  WM  search  time  by  saving  state  between  cycles, 
a  feature  that  takes  advantage  of  the  small  fraction  of  WM  typically  changed  per  cycle; 
second,  Rete  recognizes  condition  element  references  common  to  multiple  rules  and  exe¬ 
cutes  common  tests  only  once.  The  data-flow  like  organization  of  the  Rete  netw’ork  makes 
it  suitable  for  parallel  implementations  to  take  particular  advantage  of  production  par¬ 
allelism,  node  parallelism,  and  act  parallelism  during  execution.  Parallel  multicomputer 
implementation  of  a  production  system  interpreter  that  employs  Rete  is  an  area  of  research 
still  in  its  infancy. 
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l\  .  Research  Methodology 


This  chapter  presents  and  justifies  the  research  methodology  applied  to  parallelization 
of  the  RAV  expert  system  during  this  research  investigation.  The  following  are  the  restated 
and  expanded  goals  of  this  research: 

•  Present  the  fastest  processing  of  the  RAV  expert  system  achieved  to-date  using  the 
current  state-of-the-art  parallel  design  and  determine  this  design's  expected  perfor¬ 
mance  if  implemented  on  the  iPSC/2. 

•  Design  and  implement  a  new  parallel  design  on  the  iPSC/2  using  a  more  efficient 
and  effective  match-select-act  algorithm  (Rete)  to  achieve  speedup  over  the  current 
state-of-the-art  design. 

•  Ensure  valid  performance  comparisons  between  these  parallel  designs  by  considering 
implementations  of  the  designs  that  use  the  same  tools  (e.g.  hardware,  language) 
and  the  same  input  data  wherever  possible. 

•  Determine  the  relative  usefulness  of  these  parallel  designs  by  showing  their  relation 
to  the  expected  lower  and  upper  bounds  of  parallel  processing  performance  and  to 
the  desired  “real-time”  performance  of  the  RAV  expert  system. 

4-1  Justification  of  Method  Selected 

It  is  common  practice  in  parallel  computer  architecture  research  to  compare  the 
performance  of  one's  ncu'  design  with  that  of  the  current  state-of-the-art  design  applied  to 
the  same  problem.  Speed  and  correctness  of  processing  are  the  key  performance  criteria 
analyzed.  The  comparison  of  designs  is  necessary  to  show  the  advancement  of  knowledge 
in  the  field  of  application.  Hut  this  approach  by  itself  offers  only  a  very  limited  analysis 
of  the  merit  of  the  new  design  for  two  main  reasons. 

First,  the  new'  parallel  design  is  often  configured  to  run  on  a  different  machine  than 
that  used  by  the  previous  parallel  design.  Consequently,  the  performance  speedups  at¬ 
tributable  to  the  different  hardware  are  not  distinguishable  from  the  speedups  attributable 
solely  to  the  new  program  design.  It  is  imperative  that  architectures  being  compared  use 


4-1 


identical,  or  at  least  very  similar,  hardware  to  isolate  the  performance  differences  at¬ 
tributable  to  program  design. 

Second,  the  true  merit  of  the  current  state-of-the-art  parallel  design  is  often  a  mys¬ 
tery.  This  is  especially  true  when  current  research  into  a  particular  area  of  application  is 
relatively  immature.  In  this  case,  it  is  not  enough  that  a  new  parallel  design  outperforms 
the  previous  “best”  design,  because  both  designs’  performances  may  still  fall  far  short  of 
the  theoretical  performance  potential  for  such  an  application. 

4-2  Performance  Spectrum 

To  determine  the  true  merit  of  a  parallel  design  for  a  particular  application,  one  must 
determine  where  on  a  performance  spectrum  this  parallel  design  lies  in  terms  of  processing 
speed.  The  processing  speed  is  defined  in  terms  that  are  significant  to  the  particular  appli¬ 
cation  (e.g..  for  expert  systems,  the  performance  metric  of  interest  is  typically  the  average 
number  of  rules  fired  per  second).  Two  logical  and  essential  metrics  on  the  performance 
spectrum  are  a  good  serial  design's  processing  speed  and  the  required,  or  goal,  processing 
speed. 

Performance  data  on  a  serial,  or  single-processor,  design  for  an  application  is  often 
available  to  the  researcher.  In  fact,  it  is  sometimes  the  failure  of  a  serial  approach  to 
solve  a  problem  in  what  the  user  defines  as  “real-time”  that  leads  to  attempts  at  parallel 
solutions.  Although  the  “goodness”  of  a  particular  serial  design  is  difficult  to  quantify, 
still  the  performance  of  some  serial  design  is  useful  to  quantify  the  performance  payback 
realized  by  parallelizing  a  solution  in  the  new  design.  The  serial  design’s  performance 
serves  as  the  lower  bound  on  the  performance  spectrum. 

The  other  key  metric  on  the  performance  spectrum  is  the  goal  processing  speed.  The 
maturity  level  of  computer  architecture  research  with  respect  to  a  particular  application 
can  only  be  determined  by  comparing  the  performance  of  a  new  design  with  the  goal 
performance.  For  example,  the  “real-time”  performance  requirement  is  the  goal  driving 
most  research  in  parallel  processing  of  expert  systems.  The  real-time  metric  defines  the 
speed  at  which  computational  results  must  be  produced  for  the  particular  application  to 
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be  successful. 


An  additional  metric  on  the  performance  spectrum  is  needed  to  show  if  it  is  rea¬ 
sonable  for  the  researcher  to  expect  to  meet  the  performance  goal  within  the  limitations 
of  research  assumptions.  For  example,  although  processing  speed  comparisons  among  a 
serial  design,  a  state-of-the-art  design,  and  a  new  design  may  all  show  progress  toward 
the  goal  processing  speed,  the  hardware  of  choice  may  not  be  physically  capable  of  ever 
actually  achieving  the  goal  performance.  Consequently,  there  exists  a  research  need  for 
an  trpper  bound  performance  metric  under  given  application  and  architecture  constraints. 
Of  course,  because  the  upper  bound  is  not  readily  attainable  like  the  other  performance 
spectrum  metrics,  theoretical  and/or  simulation  methods  must  be  applied. 

Figure  4.1  is  an  example  performance  spectrum  chart  illustrating  the  above  metrics 
and  their  interaction  for  a  fictional  application.  Note  that  interpretation  of  this  figure 
suggests  the  following: 

1.  The  performance  of  the  serial  design  applied  to  this  problem  falls  far  short  of  the  goal 
performance.  This  poor  serial  design  performance  leads  the  researcher  to  consider 
the  possibility  of  attempting  a  parallel  solution. 

2.  I  he  upper  bound  performance  metric  suggests  that  the  architectural  approach  being 
taken,  under  ideal  parallel  conditions,  has  tne  potential  to  achieve  the  goal  perfor¬ 
mance.  The  upper  bound  performance  findings  encourage  continued  research  on  the 
current  architecture. 

3.  The  state-of-the-art  design  realizes  only  limited  performance  improvement  over  the 
serial  design.  This  limited  performance  improvement  suggests  to  the  researcher  that 
perhaps  a  whole  new  design  approach,  rather  than  enhancements  to  the  previous 
design,  should  be  pursued. 

4.  The  new  design  performance  produces  significant  speedup  over  the  serial  design  and 
approaches  the  goal  performance,  lending  merit  to  the  research  contribution  of  this 

new  design. 


Processing  Speed 
(in  units  of  interest 


Legend:  1  -  lower  bound  performance  metric 

s  -  current  state-of-the-art  parallel  performance 
p  -  proposed  parallel  architecture  performance 
g  -  goal  performance 

u  -  upper  bound  on  proposed  architecture  performance 


Figure  4.1.  Example  Performance  Spectrum  Chart 
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4-3  Research  hive  stigatum  Sltps 

Based  on  the  methodology  detailed  above,  this  research  is  comprised  of  the  following 

steps: 

•  Step  1:  Determine  the  lou-(r  bound  performance  expected  in  processing  the  RAY 
expert  system.  To  do  this,  a  “good"  serial  design  is  implemented  on  the  iPSC/2 
configured  to  use  only  one  of  its  processors.  The  performance  metric  observed  during 
the  processing  of  the  RAY  expert  system  on  a  single  processor  is  recorded  as  the 
lower  bound  (worst  case)  performance  metric.  This  metric  is  needed  to  determine 
the  speedup  achieved  by  any  of  the  parallel  designs  (e.g.  Speedup  =  Time  for  Parallel 
Processors  to  Compute  Result  /  Time  for  Single  Processor  to  Compute  Result). 

•  Step  2:  Determine  the  t/p/xr  bound  performance  expected  in  parallel  processing  of  the 
RAY  expert  system.  The  upper  bound  is  needed  to  estimate  the  maximum  speedup 
achievable  under  a  given  parallel  design.  The  theoretical  maximum  speedup  of  \ 
(where  N  is  the  number  of  parallel  processors)  requires  the  very  unlikely  pairing  of 
perfect  load  balance  and  no  communication  among  processors.  A  more  realistic  esti¬ 
mate  of  the  minimal  communication  and  optima]  load  balance  achievable  is  developed 
theoretically.  Lamanna's  hypercube  model  is  adjusted  to  describe  the  performance 
of  the  RAY  expert  system  over  several  iPSC/2  configurations  of  differing  numbers  of 
parallel  processors  (2G:4). 

•  Step  Analyze  theoretically  the  expected  performance  of  the  current  state -of-tlu  -art 
parallel  architecture  as  implemented  on  the  )PSC/2.  This  analysis  entails  consider¬ 
ing  the  likely  effects  on  the  Shakley  design’s  performance  due  to  any  upgrades  added 
to  date  to  t fie  iJ\SC/2  as  compared  to  the  iPSC/1.  Analysis  results  for  the  Shak- 
ley  design  over  several  configurat  ions  of  differing  numbers  of  parallel  processors  are 
recorded. 

•  Step  4:  Design,  implement,  and  analyze  the  performance  speedup  (if  any)  of  a  ncir 
parallel  architecture  on  the  iPSC/2  using  a  better  match-select-act  algorithm  and  an 
appropriate  decomposition  algorithm.  This  design  is  exercised  over  several  iPSC/2 


configurations  employing  different  numbers  of  parallel  processors  and  the  perfor¬ 
mance  metrics  are  recorded. 

•  Step  5:  Compare  the  pertonnance  results  produced  in  previous  steps  with  respect 
to  the  RAV  nal-tinif  rt  quin  mail.  AFWAI.  project  managers  are  interviewed  to  de¬ 
termine  how  fast  the  RAV  expert  system  must  be  processed  in  order  for  the  RAV 
to  be  feasible  (e.g  how  fast  is  “real-time”?).  This  real-time'  performance  metric  is 
plotted  in  relation  to  the  plotted  performance  metrics  determined  in  Steps  1  through 
•1.  lysis  of  tin-  performance  metrics  shows  whether  any  one  of  t lit'  actual  designs 
implemented  on  the  il'SC/2  meets  the  RAV  real-time  performance  requirement,  o, 
whether  the  simulated  model  of  ;.n  optimal  design  can  support  the  RAV  expert  sys¬ 
tem  in  real-time.  In  the  latter  case,  such  findings  suggest  how  many  ii’SC/2  like 
processors  need  to  be  applied  to  achieve  real-time  processing  of  the  RAV  expert 
system  given  minimal  communication  overhead  and  perfect  load  balance.  This  infor¬ 
mation  could  add  insight  into  the  feasibility  of  the  RAV  project. 

j../  Statistical  Proculun  s 

All  RAV  expert  system  designs  operate  on  the  same  data  sets.  The  designs  arc' 
validated  by  direct  inspection  of  the  results  and  of  performance'  metrics  compiled  during 
expert  system  processing. 

I  lie  expert  system  performance  metric  of  particular  interest  is  processing  speed  at¬ 
tained.  defined  in  terms  of  the  average  number  of  rules  fired  per  second,  'l  iming  data  are 
also  collected  on  the  average  time  spent  in  each  of  the  match,  select,  and  act  steps  during 
processing. 

j .  ■")  Summary 

The  research  methodology  applied  in  this  investigation  stresses  not  only  the  develop¬ 
ment  of  a  new  parallel  architecture  for  the  RAV  expert  system  but  also  the  quantification 
of  this  design’s  contribution  to  RAV  expert  system  research.  The  following  is  a  summary 
of  the  RAV  expert  system  research  investigation  stops: 


•  Step  1:  Determine  the  lower  hound  performance  metric  realized  by  a  good  serial 
RAV  expert  system  design. 

•  Step  2:  Determine  theoretically  the  tipper  bound  performance  metrics  achievable  by 
a  parallel  design  implemented  on  the  chosen  architect ure. 

•  Step  3:  Determine  theoretically  the  current  state-of-the-art  parallel  design's  perfor¬ 
mance  metrics  realized  when  running  on  the  chosen  architecture. 

•  Step  1:  Determine  the  actual  performance  metrics  realized  by  a  new  parallel  design 
implemented  on  the  chosen  architecture. 

•  Step  5:  Compare  these  performance  metrics  to  the  real-time  performance  require¬ 
ment  . 
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\  .  Step  1:  Lower  Bound  Performance 


5. 1  System  Design 

To  define  the  processing  speedup  attributable  to  parallel  processing  of  a  program,  one 
must  first  establish  the  processing  speed  of  a  “good"  serial  design  as  a  base  of  comparison 
(e.g.  Speedup  =  Parallel  Processing  Time  /  Serial  Processing  Time),  for  this  research, 
the  processing  speed  of  a  "good"  serial  design  is  needed  to  delineate  a  lower  bound  of 
performance  expected  in  processing  the  RAV  expert  system.  Furthermore,  the  serial  design 
must  be  implemented  on  hardware  similar  to  the  hardware  upon  which  subsequent  parallel 
designs  are  implemented  if  comparison  of  their  processing  speeds  is  to  be  valid. 

The  proposed  serial  design  employs  an  existing  serial  expert  system  interpreter,  or 
shell,  which  uses  the  Rete  algorithm  when  performing  the  match-select-act  cycle  on  any 
input  set  of  rules  and  initial  facts.  The  design  decision  to  use  an  existing  shell  to  support 
the  RAY  expert  system  allows  this  research  investigation  to  take  full  advantage  of  the 
Rete  algorithm  optimization  efforts  afforded  during  the  implementation  of  the  expert  sys¬ 
tem  shell.  Expert  system  shells  considered  as  alternatives  for  this  implementation  include 
Inference  Corporation’s  Automated  Reasoning  Tool  (ART)  (1).  Carnegie-Mellon  Univer¬ 
sity's  OPS5  and  paraOPSo  (9),  and  NASA’s  C-Language  Integrated  Production  System 
(CLIPS)  (5). 

The  original  RAY  expert  system  was  developed  using  ART  (see  Appendix  A),  mak¬ 
ing  this  expert  system  shell  a  good  candidate  from  an  RAY  knowledge-base  portability 
perspective.  But  ART  is  not  an  attractive  alternative  for  this  design  effort  for  several 
reasons.  ART  is  currently  available  only  in  Lisp-based  and  Bliss-based  versions.  A  C- 
based  interpreter  is  desired  for  this  investigation,  rather  than  a  shell  based  in  a  symbolic 
language,  for  both  program  efficiency  and  program  portability  reasons  (2(3:190).  Also,  the 
cost  of  acquiring  new  versions  of  ART  for  the  purpose  of  this  research  proved  prohibitive. 

Carnegie- Mellon's  OPSb  expert  system  shell  series  was  developed  and  optimized  by 
the  originators  of  the  pete  algorithm  (10).  But,  again,  the  early  versions  of  OPSo  are 
Lisp-based.  A  later  parallel  version,  called  paraOPS5,  can  execute  in  serial  mode  and  is 
C  based  at  a  macro  level.  But  paraOPSo  is  only  partially  C-coded,  with  the  Rete  network 


embedded  directly  in  the  National  Semiconductor  NS32032  machine  code  for  realization 
of  more  speedup  (17:96).  Neither  of  these  existing  designs  lend  themselves  to  convenient 
rehosting  into  a  serial,  fully  C’-based  OPS5  to  execute  on  the  iPSC/2's  Intel  8038G  chip. 

NASA's  C-Language  Integrated  Production  System  (CLIPS)  interpreter  is  chosen 
as  the  expert  system  sheii  for  this  research  investigation  (33:713).  As  the  shell's  name 
suggests,  the  serial  CLIPS  is  written  in  C  specifically  for  the  purposes  of  efficiency  and 
portability  (3:71).  Of  course,  to  exercise  the  RAY  expert  system  using  the  CLIPS  in¬ 
terpreter,  the  RAY  knowledge  base  (rules  and  facts)  is  transliterated  from  its  original 
Automated  Reasoning  Tool  (ART)  syntax  to  CLIPS  syntax  with  no  loss  of  functionality. 

5.2  Detailed  Design 

A  full  CLIPS  interpreter  executes  on  the  host  processor  of  the  iPSC/2.  The  source 
code  of  the  CLIPS  program,  written  in  the  C-Programming  Language,  is  compiled  without 
modification  using  the  Greenhill  C  compiler  under  the  UNIX/System  Y  operating  system. 
At  system  initialization,  the  processor  is  loaded  with  all  of  the  production  rules  in  the  RAY 
knowledge  base  from  which  to  build  a  Rete  network.  Then  the  initial  facts  are  asserted  in 
working  memory,  after  which  the  RAY  expert  system  is  ready  to  execute.  When  production 
system  execution  is  complete,  performance  data  are  collected  and  displayed  by  CLIPS  (e.g. 
rules  fired,  execution  time).  The  high-level  algorithm  employed  by  the  serial  design  is 
illustrated  in  Figure  5.1. 

Note  that  this  algorithm  is  the  same  as  the  match-select-act  cycle  algorithm  described 
in  Chapter  II.  Of  course,  no  interprocessor  communication  is  required  because  the  iPSC/2 
is  configured  as  an  SIS!)  computer  (see  Appendix  B). 

5.3  Implementation 

Because  the  expert  system  portions  of  the  RAY  constitute  the  scope  of  this  study, 
only  the  Piloting  Expert  System  (PES)  and  Vehicle  Control  Expert  System  (VCES)  are 
executed  under  CLIPS  (see  Appendix  A).  That  is,  the  conventionally  programmed  subsys¬ 
tems  of  the  prototype  RAY  design,  such  as  the  Route  Planner  and  the  Intelligent  Vehicle 
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Procedure  RETE: 

1.  do  while  (termination  state  not  detected); 

2.  match  •  update  the  Rete  network  with  VYME 

change  applied  during  the  last  cycle 

3.  select  -  select  a  production  from  conflict  set 

4.  act  -  apply  WME  change  specified  by  selected 

production’s  RHS 

5.  end  do; 

Figure  5.1.  RAV  Serial  Rete  Design 

Workstation,  are  not  present  to  provide  inputs  to  the  PES  and  VCES.  Consequently,  the 
RHSs  of  key  rules  are  altered  to  artificially  introduce  the  values  normally  produced  by 
one  or  more  of  the  missing  conventional  subsystems.  In  this  way,  the  RAV  rules  are  kept 
firing  to  simulate  progression  through  a  reasonable  air  mission.  The  benchmark  air  mission 
consists  of  the  execution  of  the  entire  RAV  takeoff  sequence  of  rules,  the  initiation  of  all 
possible  RAV  air  maneuver  rule  sequences  in  the  knowledge  base,  and  completion  of  the 
entire  RAV  landing  sequence  of  rules. 


The  initial  facts  are  asserted  into  working  memory  by  the  firing  of  a  startup  rule. 
The  startup  rule  has  no  conditions  in  its  LHS,  meaning  it  satisfied  regardless  of  the  state 
of  WM.  The  RHS  of  the  startup  rule  consists  of  a  set  of  fact  assertions  that,  when  the 
rule  is  fired,  load  all  of  the  facts  required  to  activate  the  desired  set  of  initial  RAV  rules. 
The  subsequent  RAV  rule  firings  simulate  the  guidance  of  an  aircraft  through  the  entire 
takeoff  sequence,  a  series  of  air  maneuvers,  and  the  entire  landing  sequence.  It  is  for  these 
subsequent  rule  firings  that  timing  data  are  collected.  For  this  study’s  benchmark  RAV 
execution,  a  total  of  73  RAV  rules  are  fired  in  approximately  3.5  seconds  by  the  CLIPS 
interpreter  for  an  average  of  20.9  rules  per  second.  The  entire  RAV  rule  set  consists  of  273 
rules.  Therefore,  the  73  rule  firings  observed,  representing  nearly  27%  of  the  RAV  rule 
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base,  is  considered  a  valid  number  to  show  performance  difference  among  designs. 

5.4  Summary 

NASA's  serial  C-Language  Integrated  Production  System  (CLIPS)  shell  is  used  to 
execute  the  RAY  expert  system.  The  original  ART-syntax  RAY  knowledge  base  is  translit¬ 
erated  into  CLIPS-syntax  and  is  adapted  to  run  without  external  input  to  allow  execution 
of  the  RAY  using  CLIPS.  A  full  CLIPS  interpretter  executes  on  the  host  processor  of  the 
iPSC/2  under  the  UNIX/System  Y  operating  system.  Using  this  serial  design,  an  average 
RAY  processing  rate  of  20.9  rules  fired  per  second  is  observed.  This  processing  rate  serves 
as  the  lower  bound  performance  for  execution  of  the  RAY  expert  system. 
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17,  Step  2:  Upper  Bonne i  Performance 


6.1  System  Design 

A  critical  metric  in  any  parallel  architecture  design  is  the  estimate  of  the  maximum 
speedup  achievable.  The  theoretical  maximum  speedup  of  N  (where  N  is  the  number  of 
parallel  processors)  assumes  the  unlikely  pairing  of  perfect  load  balance  and  no  communi¬ 
cation  overhead.  A  more  realistic  estimate  of  the  minimal  communication  overhead  and 
near  optimal  load  balance  achie%-able  within  a  given  design  must  be  developed  theoretically 
and/or  through  simulation. 

As  Lamanna  points  out  in  her  Performance  Study  of  the  Hypercube  Architecture, 
evaluating  the  performance  of  an  architecture  cannot  be  divorced  from  the  algorithm  used 
(2o:30).  Consequently,  the  theoretical  upper  bound  performance  metric  in  this  research 
investigation  represents  the  maximum  potential  RAV  expert  system  processing  speedup 
realizable  using  the  parallel  expert  system  algorithm  proposed  under  ideal  communication 
and  load  balance  conditions.  The  units  of  speedup  of  interest  regarding  the  RAV  expert 
system  are  the  number  of  rules  fired  per  second. 

6.2  Detailed  Design 

The  upper  bound  performance  analysis  presented  here  follows  closely  the  timing 
analysis  detailed  in  Appendix  C. 

Certain  assumptions  are  made  at  the  onset  of  this  analysis  to  present  an  ideal  com¬ 
puting  environment  for  the  RAV  expert  system  executing  on  the  proposed  parallel  ex¬ 
pert  system  shell  which  employs  mainly  production  parallelism.  First,  the  optimal  load 
balance  is  defined  as  an  even  distribution  of  the  workload  experienced  by  the  serial  algo¬ 
rithm  amongst  the  parallel  processors  available  to  the  parallel  program.  No  computational 
overhead  is  introduced  through  parallelization.  Second,  the  only  activity  other  than  com¬ 
putation  on  a  processor  that  produces  a  time  cost  is  interprocessor  communication.  No 
system  interface  overhead,  such  as  input  or  output  (I/O),  is  allowed  to  degrade  optimal 
performance. 
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From  Appendix  C,  the  time  required  to  complete  one  match-select-act  cycle  under 
the  proposed  parallel  design  is  defined  as  follows: 

( Equation  6.0) 

0([max pe([suvipe  aCEa  (match  filter  time)]  +  local  select  time)]  + 
select  compare/exchange  time  + 
act  broadcast  time  +  local  act  time  ) 

This  equation  states  that  the  cycle  time  consists  of  1)  the  maximum  time  spent  by 
one  of  the  processors  updating  its  local  Rete  network  and  selecting  a  candidate  rule  to 
fire  from  its  local  conflict  set,  plus  2)  the  time  for  the  processors  to  determine,  through 
a  gray-code  compare/exchange,  which  processor  has  the  best  candidate  rule  to  fire  ,  plus 
3)  the  time  to  broadcast  the  actions  specified  in  the  R1IS  of  the  rule  to  fire.  Under 
the  ideal  condition  assumptions  described  above,  the  time  spent  by  each  processor  to 
update  its  Rete  network,  select  from  its  conflict  set,  and  fire  the  best  rule’s  RHS  actions 
is  uniform  across  all  processors  (e.g.  perfect  load  balance).  Furthermore,  the  sum  of  the 
times  spent  processing  these  uniform  task  loads  equals  the  total  time  spent  processing  the 
entire  workload  serially.  Thus  Equation  6.0  simplifies  to  the  following,  with  N  being  the 
number  of  available  processors: 

( Equation  6.1 ) 


0((  total  serial  processing  time  /  N  )  + 
select  compare/exchange  time  + 
act  broadcast  time  ) 


6.3  Implementation 

The  task  of  determining  the  upper  bound  performance  for  the  parallel  design  pro¬ 
posed  in  this  investigation  now  becomes  that  of  acquiring  actual  and/or  expected  times  for 
the  total  RAV  expert  system  serial  processing  time,  the  average  select  compare/exchange 
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time,  and  the  typical  act  broadcast  time  experienced  on  the  iPSC/2  hardware.  Substitut¬ 
ing  these  time  values  into  the  equation  presented  in  the  previous  section  yields  the  upper 
bound  of  expected  performance  in  terms  of  the  total  processing  time  required  to  complete 
one  match-select-act  cycle. 

The  total  serial  processing  time,  determined  empirically  using  the  serial  CLIPS  design 
described  in  Chapter  V,  is  3.5  seconds  to  fire  73  rules.  The  select  compare/exchange  time 
and  act  broadcast  time  each  depend  on  the  data  rate  of  the  interprocessor  communication 
lines  on  the  iPSC/2  and  on  the  size  of  the  data  structure  sent  as  a  message.  Because  the 
minimum  message  data  structures  for  botli  types  of  messages  were  known,  prior  to  actual 
parallel  design  implementation,  the  processing  times  for  these  activities  are  also  determined 
empirically  on  the  iPSC/2  hardware. 

The  data  structure  passed  during  a  gray-code  compare/exchange  consists,  as  a  min¬ 
imum,  of  the  integer  ID  of  the  processor  passing  the  message  and  an  integer  value  repre¬ 
senting  the  firing  priority,  or  salience ,  of  its  candidate  rule.  A  total  of  d  communications 
of  such  a  structure  (where  d  is  the  dimension  of  the  hypercube)  is  required  to  ensure  the 
structure  representing  the  best  rule-to-fire  candidate  is  at  the  base  processor,  say  node  0. 
The  last  communication  required  in  the  select  step  is  the  broadcast  of  the  best  rule-to-fire 
structure  from  the  base  node  0  to  the  other  processors.  The  further  assumption  is  made 
that  a  broadcast  requires  the  same  amount  of  time  as  does  a  node-to-node  communica¬ 
tion.  A  total  of  d+ 1  communications  during  the  select  step  add  cost  to  the  total  program 
execution  time. 

The  data  structure  broadcasted  during  the  act  represents  the  RHS  actions  of  the  rule 
selected  for  firing.  These  RHS  actions  can  consist  of  any  number  of  fact  assertions,  fact 
retractions,  and  interface  actions  (such  as  I/O).  Again,  to  produce  ideal  computing  condi¬ 
tions  for  optimal  processing  speed,  only  fact  assertions  and  fact  retractions  are  considered 
in  this  analysis.  Another  simplification  is  that  all  of  a  selected  rule's  RHS  actions  are 
passed  in  a  single  data  structure  large  enough  to  contain  the  average  number  of  assertions 
and  retractions  specified  by  a  typical  rule  in  the  RAV  rule  set.  The  assertions  are  assumed 
sent  in  the  form  of  a  typical  RAV  fact  string  and  the  retractions  are  assumed  sent  in  the 
form  of  an  index  to  the  fact  in  WM  to  be  retracted. 
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Inspection  of  the  RAY  knowledge  base  suggests  that  the  average  rule’s  R II S  specifies 
approximately  two  fact  assertions  (1.46  average  observed)  and  about  two  fact  retractions 
( 1.18  average  observed ).  The  typical  length  of  a  fact  string  to  be  asserted  is  approximately 
231  characters,  which  represents  a  message  size  of  231  bytes  for  each  asserted  string. 
Integer  IDs  of  facts  to  be  retracted  add  8  bytes  each  to  the  message  size.  Under  the  above 
assumptions,  the  typical  single  message  broadcasted  during  the  act  step  is  about  470  bytes 
long. 

The  communication  times  required  for  passing  of  select  step  and  act  step  messages 
on  the  iPSC/2  are  determined  empirically  using  a  simple  ring  communication  program 
that  sends  messages  of  the  specified  size  around  the  nodes  of  the  hypercube,  configured 
as  a  ring.  Timing  data  are  collected  as  message  passing  proceeds.  Each  select  message 
communication  can  be  completed  in  0.00424  seconds.  The  single  act  message  can  be 
broadcasted  in  0.00776  seconds. 

Summing  the  times  derived  above,  the  upper  limit  on  the  time  required  for  the  pro¬ 
posed  parallel  design  to  process  the  73  rules  fired  in  the  RAY  benchmark  follows  Equation 
6.1: 


(  3.o  /  N  )  +  ((rf+1)  *  0.00424)  +  (0.00776)  seconds 

where  N  =  2d  is  the  number  of  parallel  processors  used.  Dividing  the  73  rules  fired 
by  the  result  of  this  equation  yields  the  upper  bound  performance,  in  rules  per  second,  for 
the  proposed  parallel  design. 

6.4  Summary 

The  theoretical  upper  bound  performance  metric  represents  the  maximum  poten¬ 
tial  RAY  expert  system  processing  speedup  realizable  using  the  proposed  design  under 
ideal  communication  and  load  balance  conditions.  Assumptions  made  to  simulate  ideal 
processing  conditions  include  the  following: 

•  Ideal  load  balance  suggests  even  distribution  of  the  serial  workload  amongst  available 
parallel  processors. 
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•  Neither  computational  overhead  due  to  parallelization  nor  system  interface  overhead 
due  to  I/O  is  considered. 

•  The  minimum-size  data  structures  are  assumed  passed  whenever  communication  is 
required,  and  both  node-to-node  and  broadcast  communication  times  are  uniform 
and  equal. 

•  All  of  the  actions  specified  in  a  rule's  RHS  can  be  contained  in  a  single  data  structure 
for  communication  purposes. 

Communication  times  required  for  passing  messages  of  sizes  typical  to  the  RAY  are  deter¬ 
mined  empirically.  The  equation  for  the  upper  limit  on  processing  performance  is 

(  3.5  /  N  )  +  ((d+  1)  *  0.00424)  +  (0.00770)  seconds 
where  N  =  2d  is  the  number  of  parallel  processors  used. 
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17/.  Step  3:  Current  Best  Performance 


7.1  System  Design 

The  performance  of  any  new  parallel  architecture  must  bo  compared  to  that  of  any 
existing  parallel  architecture  that  is  considered  state-of-the-art.  This  comparison  is  nec¬ 
essary  to  show  advancement  in  knowledge  for  fast  processing  of  a  particular  application. 
But  for  the  comparison  between  parallel  designs  to  be  valid,  the  designs  being  compared 
must  experience  similar  support  environments  (e.g.  hardware,  compilers,  languages).  Oth¬ 
erwise,  it  is  difficult  to  discern  whether  performance  differences  observed  are  due  to  the 
designs  or  rather  to  their  individual  support  environments. 

The  latest  architecture  applied  to  the  RAV  expert  system  was  designed  by  Shakley 
(37).  The  processing  speed  of  the  new  parallel  architecture  proposed  in  this  research  inves¬ 
tigation  is  compared  to  the  speed  achieved  by  Shakley ’s  architecture.  Shakley ’s  program 
design  was  implemented  in  Lisp  and  on  the  first  generation  Intel  Personal  Super  Computer 
(iPSC/l),  a  support  environment  different  from  that  of  the  new  architecture.  Because  ac¬ 
tual  reimplementation  of  the  Shakley  uesign  in  C  and  on  the  iPSC/2  is  neither  within  the 
scope  of  this  research  nor  desired,  a  theoretical  “reimplementation”  is  offered  instead.  That 
is,  the  likely  effects  on  the  Shakley  design’s  performance  due  to  any  upgrades  to  the  iPSC/2 
as  compared  to  the  iPSC/l  are  analyzed  theoretically.  The  theoretical  performance  results 
for  Shakley "s  design  are  then  used  for  comparison  to  the  new  parallel  design’s  performance 
results. 

7.2  Detailed  Design 

The  purpose  of  Shakley’s  research  investigation  was  to  analyze  and  explore  the  fea¬ 
sibility  of  translating  the  RAV’  expert  system  written  in  ART  for  the  TI  Explorer  into 
CCLISP  for  rehosting  onto  the  Intel  iPSC/l  Hypercube  (see  Appendix  A).  The  focus  of 
Shakley’s  study  was  search  parallelism  within  a  production  system  (37:10-12). 

In  his  parallel  RAV  expert  system  design,  Shakley  exploits  production  parallelism. 
Production  rules  are  equally  distributed  in  a  round-robin  fashion  across  available  proces- 
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sors  in  the  iPSC’/l  hypercube  multicomputer.  The  rules  constituting  a  processor's  local 
production  memory  (PM)  are  formed  into  a  linked- list  data  structure  (37:20). 

In  ART  representation,  each  RAY  fact  comprises  part  of  a  frame-like  struct uro  of 
facts,  called  a  schema.  Shakley  preserves  support  for  schemata  in  his  design  (37:41). 
Schemata  facilitate  indexing  into  facts  in  working  memory  (YVM),  thus  shortening  the 
time  required  to  find  and  check  the  value  of  a  given  fart  during  the  match  step.  Eacli 
processor  hosts  a  copy  of  the  entire  RAY  \VM. 

For  his  parallel  RAY  expert  system  design,  Shakley  implements  on  each  processor 
an  enhanced  version  of  a  serial  inference  engine  developed  by  Winston  and  Horn  (3s). 
Enhancements  to  the  Winston  and  Horn  serial  engine  include  support  for  schemata,  for 
salience  (priority)  selection  of  rules,  and  for  selection  from  a  conflict  set  (or  agenda)  of 
rules  (37:50).  The  ART  rules  and  schemata  are  translated  into  a  form  useable  by  the  now- 
inference  engine.  The  high-level  algorithm  employed  in  Shakley *s  design  is  illustrated  in 
Figure  7.1 . 

Shakley  organizes  the  iPSC/1  processors  into  a  spanning  tree  for  interprocessor  com¬ 
munication  (37:56).  A  spanning  tree  connection  pattern  is  another  name  for  the  gray-code 
definition  of  near-neighbor  processors  in  a  hypercube  network  (24:F-1).  This  spanning  tree 
connection  pattern  defines  the  parent-child  relationships  among  processors  referred  to  in 
Figure  7. 1 . 

Shakley  uses  a  test  suite  of  small,  prearranged  sets  of  facts  to  trigger  firings  of  sub.se' s 
of  rules  in  PM.  From  these  firings,  results  are  traced  to  confirm  correctness  of  operation 
and  to  yield  performance  metrics  on  the  speedup  achieved  by  the  parallel  RAY  expert 
system  design  (37:58). 

Shakley  acknowledges  two  shortcomings  in  his  design.  First,  the  round-robin  assign¬ 
ment  of  production  rules  to  processors  creates  a  load  imbalance.  Although  all  processors 
host  the  same  total  number  of  production  rules,  these  rules  are  of  varying  length  and  com¬ 
putational  complexity,  thus  causing  an  imbalance.  Second,  the  test  suites  of  prearranged 
facts  are  too  small  to  take  significant  advantage  of  parallelism.  The  small  test  sets  have  too 
little  span  of  effect  on  the  production  rules  in  PM,  thus  limiting  the  potential  production 
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Procedure  HYP  HR: 


1.  do  while  (termination  state  not  detected); 

2.  parallel  match 

-  each  processor  waits  to  receive  WMF,  change 

from  its  parent  (except  root  node) 

-  each  processor  sends  WME  change  to  its  child 

in  the  tree  (if  any) 

-  each  processor  adds  WML  change  to  its  local 

copy  of  the  WM 

-  each  processor  matches  the  rules  in  its  PM 

against  the  facts  in  its  WM 

3.  global  select 

-  each  processor  waits  to  receive  the  selected  rule 

from  its  child  (if  any) 

-  each  processor  adds  the  rule  received  from  its 

child  to  its  conflict  set 

-  each  processor  sends  selected  rule  to  its  parent 

on  tree  (not  root); 

-  root  processor  holds  production  to  fire  after  its 

select  is  done 

4.  global  act 

-  root  processor  sends  to  its  child  the  WME  change 
specified  by  selected  rule’s  RHS 

5.  end  do; 


Figure  7.1.  RAV  iF’SC/l  Flypercube  Design  (37:47) 
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parallelism. 


1.3  Implementation 

On  the  iPSC/1.  Shakley's  parallel  design  is  exercised  using  one  of  two  subsets  of  the 
R.\V  rule  base,  which  he  terms  "small"  and  “large"  rule  bases.  For  comparison  purposes, 
the  faster  performing  small  rule  base  configuration  is  analyzed.  Shakley's  expert  system, 
using  the  updated  Winston  and  Horn  inference  engine,  fires  an  average  of  1  rule  every  11 
seconds  in  a  serial  mode.  Speedups  are  realized  in  parallel  mode,  with  a  peak  performance 
of  about  0.5  rules  per  second  experienced  in  a  16-node  configuration  (see  Figure  7.2). 


f  igure  7.2.  iPSC/1  RAV  Performance  Results  (37) 

Uploaded  to  the  iPSC/2.  the  Shakley  design  will  experience  immediate  performance 
improvement  due  to  the  raw  processing  power  of  the  iPSC/2  chip  technology  compared  to 
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that  of  the  iPSC/1.  At  best,  a  processing  speedup  of  4  times  can  be  expected  due  to  use 
of  the  iPSC/2's  HO.'iMI  chip  versus  the  iPSC/l’s  80286.  A  speedti|)  can  also  be  expected  in 
communications  capability.  Even  though  the  iPSC/2‘s  peak  message  passing  rate  of  2.8 
Mbytes  per  second  ((21:1-11))  is  hardly  discernable  from  the  iPSC/l's  rate  of  2.5  Mbytes 
per  second  ((22:1-22)).  the  iPSC/2  communications  scheme  will  generate  some  speedup, 
because  the  passing  of  a  message  between  two  nodes  on  the  iPSC/2  does  not  interrupt  the 
processing  on  intermediate  nodes,  as  is  the  case  on  the  iPSC'/l. 

Experiencing  the  maximum  possible  beneficial  effect  of  the  processing  and  communi¬ 
cations  upgrades  in  the  iPSC/2.  the  Shakley  parallel  design's  performance  can  be  expected 
to  improve  to  8  rules  per  second,  at  best.  This  analysis  is  abandoned  at  this  point,  because 
the  further  possible  performance  improvements  in  the  Shakley  design  realizable  due  to  the 
iPSC/2's  broadcast  capability  and  to  a  hypothetical  reimplementation  of  the  design  in  C 
will  certainly  not  be  enough  to  bring  the  design's  performance  up  within  the  20-rules-per- 
second  range  of  the  serial  CLIPS  lower  bound  performance.  It  is  apparent  that  the  simple 
Winson  and  Horn  inference  engine,  even  executed  in  parallel,  cannot  compete,  in  terms  of 
processing  speed,  with  the  state-saving  Rete  algorithm  employed  in  CLIPS. 


17//.  Step  4-  Parallel  Rete  Performance 


This  chapter  describes  in  detail  the  proposed  hypercube  expert  system  shell,  called 
HyperCLIPS.  The  source  code  (Appendix  D),  programmer's  manual  (Appendix  E),  and 
user's  manual  (Appendix  F)  offer  more  detail  for  the  interested  reader. 

8.1  System  Desiejn 

The  proposed  design  implements  a  parallel  production  system  interpreter  which  uses 
the  Rete  match  algorithm.  A  C-based  version  of  the  serial  C-Language  Integrated  Produc¬ 
tion  System  (CLIPS)  interpreter  is  adapted  to  run  in  parallel  on  the  Intel  iPSC/2  (33:7*13). 
The  system  is  configured  to  take  advantage  of  production  parallelism,  enhanced  by  action 
parallelism.  The  goals  of  this  system  are  the  following: 

•  support  the  speedup  features  inherent  in  the  Rete  network  as  much  as  possible  (e.g. 
state  saving,  one-time  comparisons) 

•  minimize  detrimental  communication  overhead,  especially  in  the  match  step 

•  distribute  productions  in  such  a  way  that  the  workload  is  well  balanced  among  the 
available  processors 

•  assign  productions  that  are  expected  10  be  activated  at  the  same  time  to  different 
processors  to  enhance  parallelism 

8.2  Detailed  Desiejn 

Each  active  processor  supports  a  full  production  system  interpreter.  At  system  ini¬ 
tialization,  each  processor  is  parsed  a  subset  of  the  productions  in  the  RAV  knowledge  base 
from  which  to  build  a  local  Rete  network.  With  this  static  decomposition  approach,  no 
interprocessor  communication  is  required  during  the  match  step  because  all  of  the  nodes 
needed  to  process  state  updates  of  a  production  are  local  to  the  production’s  host  proces¬ 
sor.  Furthermore,  each  processor's  interpreter  performs  a  local  select  step,  picking  its  local 
candidate  production  for  overall  system  firing.  The  local  select  step  is  performed  without 
interprocessor  communicat ion. 
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The  first  intcrprocessor  communication  occurs  when  the  processors  must  compare  and 
exchange  their  locally  selected  productions  to  determine  which  of  these  productions  is  to 
he  selected  for  firing.  Using  the  gray-code  compare/exchange  paradigm  for  2d  processors 
connected  in  a  hypercube  (see  Appendix  B),  a  total  of  d  compare/exchanges  must  be 
performed  before  the  best  candidate  production  is  guaranteed  to  be  at  the  root  processor 
(2I:F-1). 

Once  the  root  processor  has  the  globally  selected  production,  the  working  memory 
element  (WME)  change  specified  by  the  RHS  of  that  production  is  broadcasted  to  all 
processors.  This  WME  change  is  the  input  to  each  of  the  processors  that  triggers  the 
subsequent  match  step.  The  entire  match-select-act  cycle  repeats  in  this  fashion  until  a 
termination  condition  is  detected  or  no  productions  are  matched.  Figure  8.1  illustrates 
the  high-level  algorithm  implemented  in  this  design.  Note  that  the  few  communications 
required  in  this  design  occur  only  in  the  select  and  act  steps. 

This  design  exploits  all  match-select-act  cycle  parallelism  discussed  in  Chapter  111. 
Parallelism  within  the  match  step  is  achieved  when  all  processors  update  their  Rete  sub¬ 
graphs  concurrently.  Select  step  parallelism  is  possible  because  each  processor  performs 
its  select  on  its  local  conflict  set  upon  completion  of  its  local  match,  thus  creating  the 
potential  for  multiple  processors  to  be  in  their  respective  local  select  steps  concurrently. 
The  processing  within  the  match  step  and  the  select  step  of  the  same  cycle  can  overlap 
when  the  local  match  on  one  processor  completes  and  triggers  the  start  of  the  local  se¬ 
lect  step  before  one  or  more  other  processors  complete  that  same  match  step  local’y.  Act 
step  parallelism  occurs  in  that,  when  the  globally  selected  production  is  fired,  the  working 
memory  element  change  is  broadcast  to  the  waiting  processors,  triggering  their  concurrent 
match  steps.  Overlap  of  the  act  step  of  one  cycle  and  the  match  step  of  the  next  cycle  is 
conceivable.  That  is,  the  WME  change  is  allowed,  by  design,  to  arrive  at  some  processors 
before  it  arrives  at  others,  although  “broadcast”  implies  the  change  arrives  at  all  processors 
simultaneously.  Whether  or  not  changes  arrive  simultaneously,  local  match  steps  begin  as 
soon  as  the  WME  change  is  received. 
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Procedure  HYPER-RETE: 


1.  do  while  (termination  state  not  detected); 

2.  parallel  match 

-  each  processor  receives  \YM  change  from  root  processor 

-  each  processor  updates  its  local  Rete  network 

3.  parallel  local  select 

-  each  processor  selects  a  production  from  its  local 

conflict  set 

4.  global  select 

-  processors  perform  gray-code  compare/exchange 

-  root  processor  holds  production  to  fire  when 

compare/exchange  done 

5.  broadcast  global  act 

-  root  processor  broadcasts  WM  change  specified 

by  selected  production’s  RHS 

6.  end  do; 

Figure  8.1.  RAV  Hypercube  Rete  Design 


8-3 


8.3  Implementation 

A  host  program  executing  on  the  front-end  host  processor  of  the  iPSC/2  provides 
the  user  interface  to  the  HyperCLIPS  shell.  The  host  prompts  the  user  for  the  desired 
cube  dimension,  the  application  knowledge  base,  and  the  desired  run  time  options  (see 
Appendix  F).  The  node  program,  which  is  executed  on  each  active  hypercube  processor, 
downloads  from  the  host  processor  the  entire  initial  working  memory  fact  set  and  its 
partition  of  the  total  production  rule  set.  Each  node  then  initializes  and  executes  its  local 
version  of  CLIPS. 

To  implement  HyperCLIPS,  two  adaptations  to  the  serial  CLIPS  shell  are  required: 
1 )  the  global  select  gray-code  compare/exchange  capability  and  2)  the  global  act  broadcast 
capability  (from  Figure  8.1).  For  both  of  these  communication  activities,  the  main  design 
challenge  is  the  choice  of  data  structure  to  pass  as  a  message. 

From  an  implementation-independent  perspective,  it  seems  that  the  typical  message 
consists  of  a  structure  representing  one  complete  rule.  For  the  select  compare/exchangc. 
the  structure  passed  by  a  processor  node  represents  the  top  rule  on  the  node's  local  conflict 
set.  For  the  act  broadcast,  the  structure  passed  represents  the  one  rule  selected  globally 
for  firing. 

Unfortunately,  the  data  structure  for  a  CLIPS  rule  is  a  multi-directional,  multipli- 
linked  list  of  multivariate  structures.  Because  the  many  pointers  employed  in  such  a 
structure  on  a  given  processor  have  meaning  only  in  the  context  of  that  processor's  local 
memory,  the  rule  structure  must  be  par  d  and  the  actual  structure  values  put  into  an 
array  to  be  sent  to  other  processors.  Once  received,  such  an  array  must  again  be  parsed, 
reassembled  into  CLIPS  rule  form,  and  processed. 

Parsing,  passing,  and  reassembing  of  rules  is  not  chosen  for  the  HyperCLIPS  imple¬ 
mentation  for  several  reasons.  First,  the  raw  complexity  of  parsing  and  reassembling  a 
rule  structure  is  prohibitive.  Second,  the  communication  time  required  to  send  such  a  po¬ 
tentially  large  message,  added  to  the  computation  overhead  time  required  for  parsing  and 
reassembling  the  rule  structure,  does  not  suggest  efficient  use  of  processing  time.  Third, 
a  rule  structure  in  CLIPS  is  tightly  entwined  in  the  local  Rote  network,  including  data 
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describing  the  rule’s  dependencies  among  other  rules  in  its  network.  Passing  of  a  complete 
rule,  then,  entails  passing  large  portions  of  a  processor’s  local  Rete  network,  including  data 
that  is  probably  unneeded  for  processing  at  another  processor  and  possibly  even  incorrect 
and  corrupting  if  processed  at  another  processor. 

The  chosen  select  compare/exchange  design  approach  is  to  pass  not  the  entire  rule, 
but  a  structure  consisting  only  of  the  priority,  or  salience,  of  the  top  rule  on  a  processor's 
conflict  set  and  the  ID  of  the  processor  holding  the  particular  candidate  rule  for  firing. 
After  the  compare/exchange  is  complete,  processor  0  holds  the  salience  of  the  rule  to  fire 
and  the  ID  of  the  processor  where  that  rule  resides.  Processor  0  then  broadcasts  this 
information  to  all  processors,  triggering  the  processor  holding  the  globally  selected  rule  to 
enter  a  master  processing  state  while  the  other  processors  enter  a  slave  processing  state 
for  the  upcoming  global  act  broadcast. 

The  act  broadcast  design  approach  is  for  the  master  processor  to  send  as  messages 
only  the  RHS  actions  of  the  globally  selected  rule,  rather  than  the  entire  rule  structure. 
Because  the  firing  of  any  rule  ultimately  results  in  either  no  action  on  \VM  or  a  series  of 
working  memory  element  additions  and/or  retractions  (see  Chapter  III),  each  fact  addition 
and  retraction  is  broadcast  as  it  is  about  to  be  processed  on  the  master  processor.  Conse¬ 
quently,  the  same  addition  or  retraction  is  received  and  processed  on  the  slave  processors, 
lagging  the  length  of  the  broadcast  time  behind  the  master’s  processing  state. 

A  fact  string  specified  for  assertion  in  the  selected  rule’s  RHS  must  be  built  by  parsing 
the  RHS’«  rT  TPS  fact  structure,  but  this  parsing  task  is  much  less  complex  than  parsing 
the  entire  rule  structure.  An  assert  message  consists  of  this  fact  string  and  a  tag  specifying 
an  assert  operation  is  required.  A  retraction  specification  need  only  consist  of  the  integer 
ID  of  the  fact  to  be  retracted,  because  all  processors  maintain  identical  copies  of  working 
memory  with  identical  working  memory  element  IDs.  A  retract  message  consists  of  this 
working  memory  element  ID  and  a  tag  specifying  a  retract  operation  is  to  be  performed. 

Unfortunately,  execution  of  the  RAV  expert  system  on  HyperCLIPS  resulted  in  slow¬ 
down  compared  to  the  serial  CLIPS  implementation.  A  two-node  configuration  produced 
an  average  of  only  12.6  rules  fired  per  second,  followed  by  5.1 1  rules  per  second  using  four 
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nodes  and  2.01  rules  per  second  using  eight  nodes.  These  results,  although  discouraging, 
are  in  keeping  with  the  timing  analysis  conclusions  presented  at  Appendix  C. 

Note  from  Appendix  C  that,  for  the  HvperCLlPS  design  to  produce  speedup,  the 
following  two  conditions  must  exist: 

1.  The  RHSs  of  all  production  rules  must  affect  many  condition  elements  (CEs),  or 
predicates,  in  the  LHSs  of  many  other  rules.  Ideally,  the  average  number  of  CEs 
affected  would  equal  the  number  of  processors  available. 

2.  The  production  rules  must  lend  themselves  to  fortuitous  assignments  to  unique  pro¬ 
cessors.  Specifically,  the  production  rules  containing  CEs  that  initiate  processing 
along  non-interacting  Rete  network  paths  should  be  assignable  to  unique  processors. 

Inspection  of  the  RAY  expert  system  execution  suggests  that  the  first  condition  is  not 
adequately  met.  The  average  number  of  facts  affected  per  rule  firing  is  observed  to  be  less 
than  three,  with  an  observed  range  between  1  and  13.  These  few  affected  facts  represent 
less  than  0.57c  of  a  W'M  that  averages  approximately  700  total  facts  during  execution. 
Furthermore,  the  facts  changed  due  to  rule  firings  are  in  turn  observed  to  affect  an  average 
of  just  over  four  rules  each,  with  an  observed  range  between  1  and  18.  These  four  rules 
represent  only  1.5%  of  the  273  rules  present  in  the  RAV  knowledge  base.  The  small  span 
of  effect  per  rule  firing  severely  limits  the  benefits  attainable  through  parallel  processing 
of  the  RAV  expert  system  using  production  parallelism. 

HvperCLlPS,  as  implemented  in  this  research  investigation,  has  no  mechanism  to 
take  full  advantage  of  the  second  condition  for  producing  speedup  described  above  and  in 
Appendix  C’.  To  the  extent  that  the  RAV  rule  base  lends  itself  to  fortuitous  assignment  of 
rules  to  unique  processors,  HyperCLIPS  leaves  the  burden  of  rule  assignment  to  the  user. 
No  algorithmic  process  is  currently  available  to  recognize  dependencies  and  relationships 
among  rules  in  an  expert  system’s  knowledge  base  and  to  use  dependency  data  to  drive 
near-optimal  assignment  of  rules  to  processors.  Consequently,  the  load  imbalance  intro¬ 
duced  by  the  user's  inability  to  assign  rules  to  processors  in  a  way  that  takes  advantage 
of  production  parallelism  further  limits  the  processing  speed  observed  for  the  RAV  expert 
system  using  HyperCLIPS. 
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8.4  Summary 

The  proposed  hypercube  expert  system  shell  is  called  HyperCLIPS.  HyperCLIPS 
employs  a  full  serial  CLIPS  interpretter  executing  on  each  available  processor  in  the  iPSC/2 
multicomputer.  Each  iPSC/2  processor  is  parsed  a  partition  of  an  application’s  total 
rule  set,  and  performs  a  normal  CLIPS  match  step  on  the  local  rule  set.  Then  a  gray- 
code  compare/exchange  of  each  processor’s  highest-salience  rule  (tagged  with  the  rule’s 
home  processor)  transfers  the  salience  of  the  rule  to  fire  to  the  root  node.  After  the  root 
broadcasts  the  salience  and  identification  information  for  selected  rule,  the  home  processor 
for  the  selected  rule  becomes  the  action  master  processor.  The  master  drives  the  other 
slave  processors  to  mimic  its  fact  assertions  and  retractions  as  the  selected  rule’s  R11S 
is  fired.  Execution  of  the  RAV  expert  system  using  HyperCLIPS  results  in  slow  down 
compared  to  the  serial  CLIPS  implementation.  The  slow  down  results  1)  from  the  small 
span  of  effect  produced  when  the  RHSs  of  typical  RAV  rules  are  fired  and  2)  from  the  lack 
of  a  mechanism  which  recognizes  RAV  rule  inter-dependency  data  and  uses  this  data  to 
drive  assignment  of  rules  to  processors. 
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IX.  Step  5:  Performance  Comparison  Findings 


In  this  chapter,  the  performance  results  produced  in  Chapters  V  through  VIII  are 
compared  to  the  RAY  expert  system  real-time  processing  requirement. 

Real-time  processing  of  the  RAY  is  defined  through  interview  with  AFWAL  project 
managers  (7).  The  current  ART  RAV  implementation  on  the  TI  Explorer  Lisp  machines 
executes  at  a  rate  of  15  to  30  rules  per  simulated  vehicle  second.  The  simulated  vehicle 
second  is  a  unit  derived  to  account  for  the  processing  delay  introduced  when  the  expert 
system  must  wait  for  the  vehicle  simulator  to  produce  needed  vehicle  control  parameters. 
One  simulated  vehicle  second  equals  approximately  2.5  actual  seconds.  Therefore,  real¬ 
time  processing  of  the  RAV'  entails  firing  rules  at  a  rate  roughly  between  37  and  75  rules 
per  second.  An  estimate  of  50  rules  per  second  is  plotted  as  the  real-time  RAY  performance 
requirement  in  Figure  9.1. 

The  plotted  performance  metrics  produced  during  analysis  of  lower  bound  perfor¬ 
mance,  upper  bound  performance,  and  parallel  Rete  performance  complete  the  remainder 
of  Figure  9.1.  Note  that  the  current  best  performance  is  omitted  from  Figure  9.1  because 
it  falls  entirely  below  the  lower  bound. 

The  lower  bound  performance  experienced  by  the  serial  CLIPS  design  is  impressive, 
although  it  does  fall  short  of  real-time.  CLIPS  performs  well  for  the  RAV  expert  system 
application  because  of  the  Rete  state-saving  feature  (from  Chapter  III).  CLIPS's  state¬ 
saving  takes  full  advantage  of  the  very  small  rate  of  change  of  working  memory  observed 
in  Chapter  VIII,  which  is  why  CLIPS  soundly  outperformed  the  Shakley  parallel  design  in 
terms  of  rules  fired  per  second. 

The  upper  bound  performance  metric  is  plotted  for  the  shown  number  of  processors 
(N  =  2d)  and  derived  from  the  equation  for  upper  bound  offered  in  Chapter  VL  The 
upper  bound  suggests  that  it  is  reasonable  to  expect  to  achieve  real-time  processing  of  the 
RAV  expert  system  on  the  iPSC/2  using  HyperCLIPS  under  ideal  conditions.  However, 
the  model  in  Chapter  VI  does  not  take  into  account  the  dependencies  between  rules  in  the 
RAV  knowledge  base  that  limit  the  ability  to  achieve  perfect  load  balance. 
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Legend:  1  -  lower  bound  performance  metric 

p  -  proposed  parallel  architecture  performance 
g  -  goal  performance 

u  -  upper  bound  on  proposed  architecture  performance 


Figure  9.1.  RAV  Expert  System  Performance  Spectrum 
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The  performance  of  HyperCLIPS  executing  the  RAV  expert  system  is  also  plotted 
in  Figure  9.1.  The  slow  down  experienced  by  HyperCLIPS  suggests,  at  best,  that  an 
algorithmic  mechanism  is  required  to  assign  input  rules  to  available  processors  in  a  way 
that  optimizes  potential  production  parallelism.  Worst  case,  the  slow  down  suggests  that 
the  RAV  expert  system  itself  does  not  exhibit  the  inter-rule  dependencies  necessary  to 
allow  a  significant  amount  of  production  parallelism. 


9-3 


A.  Conclusions  and  Recommendations 


10.1  Research  Conclusions 

Parallel  processing  is  a  promising  approach  to  achieving  real-time  processing  of  expert 
system  software.  The  keys  to  improving  parallel  processing  performance  are  reducing 
communications  overhead  and  balancing  task  load.  The  major  factor  in  both  of  these 
goals  is  the  proper  choice  of  a  problem  decomposition. 

This  research  goes  beyond  just  producing  a  new  parallel  architecture  design.  The 
performance  results  of  this  design  are  quantified  in  relation  to  the  lower  and  upper  perfor¬ 
mance  bounds,  the  current  state-of-the-art  design’s  performance,  and  the  required  real-time 
performance.  This  approach  not  only  adds  validity  to  the  performance  results  of  the  new 
design  but  also  exposes  the  level  of  maturity  the  RAV  expert  system  research  has  achieved 
as  a  consequence  of  this  design.  This  performance  quantification  methodology  serves  as  a 
template  for  other  researchers  performing  parallel  computer  architecture  design  as  applied 
to  any  application. 

10.2  Research  Recommendations 

The  findings  analyzed  in  Chapter  IX  suggest  that  research  into  parallel  processing 
of  the  RAY  expert  system  is  still  in  its  infancy.  The  speedups  realized  using  serial  CLIPS 
(Chapter  V)  support  the  continued  use  of  state- saving  algorithms,  such  as  Rete.  to  pro¬ 
cess  the  RAV.  But  the  characteristics  of  the  RAV  observed  in  Chapter  VIII  suggest  that 
the  system  lends  itself  to  very  limited  potential  speedup  due  to  production  parallelism. 
Therefore,  RAV  expert  system  research  is  perhaps  better  served  by  approaching  paral¬ 
lel  processing  from  the  standpoint  of  node  parallelism,  possibly  using  a  shared-memory 
multiprocessor  (see  Chapter  III). 

The  critical  component  missing  during  this  research  investigation  is  an  algorithmic 
mechanism  to  assign  production  rules  to  available  parallel  processor  nodes.  Such  a  mecha¬ 
nism  will  parse  rules,  recognize  dependencies  among  rules  that  promote  production  paral¬ 
lelism  during  rule  firings,  and  use  this  information  to  assign  rules  to  available  processors. 
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The  results  of  the  timing  analysis  in  Appendix  C  suggest  that  the  assignment  algorithm  is 


driven  by  the  span  of  effect  of  actions  in  the  RHSs  of  all  available  rules. 

An  “optimal"  assignment  algorithm  potentially  poses  an  NT-complete  problem,  an 
fast  execution  of  the  algorithm  may  require  parallel  processing  itself!  Hut  an  assignment  al¬ 
gorithm  is  critical  for  designs,  such  as  HyperCLIPS,  that  depend  on  production  parallelism 
to  realize  processing  speedup. 

Further  exercise  of  the  HyperCLIPS  expert  system  shell,  using  applications  more 
amenable  to  production  parallelism,  is  recommended.  HyperCLIPS  also  serves  as  a  possible 
tool  for  the  development  of  an  optimal  assignment  algorithm  described  above. 

10.3  Summary 

Parallel  processing  of  the  RAY  expert  system  is  still  in  its  infancy  as  an  area  of 
research.  Processing  of  the  RAY  expert  system  using  serial  CLIPS  executes  at  an  impres¬ 
sive  speed  due  to  the  state-saving  Rete  algorithm.  But  the  HyperCLIPS  implementation 
performs  poorly  for  the  RAY  application  due  to  the  limited  potential  production  paral¬ 
lelism  characteristic  of  the  RAY  and  due  to  the  need  for  a  rule  assignment  mechanism. 
Further  exercise  of  HyperCLIPS  using  applications  with  significant  potential  production 
parallelism  is  recommended. 
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Appendix  A.  Robotic  Air  l  chicle  Backtjround 


The  Robot  ic  Air  Vehicle  (RAY)  is  a  concept  under  exploration  by  the  Defense  Ad¬ 
vanced  Research  Projects  Agency  (DARPA)  and  the  Air  Force  Wright  Aeronautical  Lab¬ 
oratories  (AFWAL).  The  concept  is  to  create  an  unmanned  air  vehicle  capable  of  au¬ 
tonomous  operation.  The  RAY  must  be  able  to  perform  basic  piloting  skills  as  well  as 
passive  terrain  following,  terrain  avoidance,  obstacle  avoidance,  and  autonomous  naviga¬ 
tion.  The  mission  of  such  a  vehicle  would  consist  of  intelligent  reconnaissance  or  attack  of 
high  risk,  heavily  defended  targets.  A  contract  was  awarded  to  Texas  Instruments  Incor¬ 
porated  (TI)  in  September  19  So  to  develop  a  system  architect  tire  and  to  demonstrate  the 
feasibility  of  some  of  the  key  components  of  such  a  system  (2>).  TI  completed  its  system 
development  and  demonstration  in  June  1988  (12:1). 

The  final  system  architecture  developed  by  TI  is  shown  in  Figure  A.l.  The  RAY 
system  software  consists  of  six  top  level  modules:  the  Route  Planner,  the  Piloting  Expert 
System  (PFS).  the  Vehicle  Control  System  (YCS),  the  Spatial  Database,  the  Intelligent 
Vehicle  Workstation  (IYW),  and  the  Natural  Language  Menu  (NLMenu)  system  (12:G). 
This  architecture  includes  several  expert  subsystems  linked  to  a  central  controlling  agent, 
the  Piloting  Expert  System  (PES).  The  expert  «•••  bsystems  consist  of  the  following:  a 
Passive  Navigation  module  for  estimating  the  current  platform  location  on  a  digital  map 
and  for  generating  a  terrain  model  for  terrain  following;  an  Airspace  Expert  System  for 
three  dimensional  situation  awareness;  a  Route  Planner  for  generating  the  RAV  flight  pat h; 
a  Vehicle  Control  System  for  translating  high  level  flight  directives  into  stick  and  throttle 
manipulations.  The  PES  coordinates  the  activities  of  these  subsystems  and  is  the  final 
arbiter  of  RAV  responses  to  the  environment  (12:15). 

During  typical  system  operation,  mission  and  flight  data  is  entered  via  NLMenu 
or  Voice  and  sent  to  the  PES.  The  PES  sends  the  waypoint  coordinates,  including  target 
location,  to  the  Route  Planner,  which  calculates  a  path  through  t lie  waypoints.  As  t he  PES 
executes  the  route  plan,  it  responds  to  inputs  from  NLMenu  and  1VW  vehicle  simulation 
and  queries  the  Spatial  Database.  Spatial  Database  queries  activate  both  the  Passive 
Navigation  subsystem  and  the  Airspace  Expert  System.  These  inputs  are  used  to  pilot 


Figure  A.l.  RAV  System  Architecture  (1 1:7) 

the  vehicle  and  to  execute  the  mission  plan.  To  perform  complex  maneuvers  such  as  steep 
turns  and  rolls,  the  RES  issues  airspeed,  altitude  and  heading  commands  to  the  Y(’S. 

Figure  A. 2  shows  the  physical  placement  of  the  RAY  subsystems  with  respect  to  their 
hardware  requirements.  The  RAY  system  software  resides  in  a  hardware  configurat ion  of 
four  Texas  Instrument  Explorer  II  Lisp  Machines  and  one  Digital  Equipment  Corporation 
(DEC)  Micro Yax  II.  All  machines  are  linked  together  via  a  Local  Area  Network  (LAN). 
A  message  scheme,  known  as  Post  Office,  facilitates  communication  among  the  machines 
and  their  resident  software  subsystems.  The  Explorers  support  ail  of  the  AI  software  and 
the  vehicle  control  and  simulation  software.  1  he  Micro  Yax  II  performs  the  mathematic 
computations  required  for  con  entionaJ  software  systems  (12:')). 

Knowdedge  bases  for  these  expert  systems  were  developed  using  conventional  knowl¬ 
edge  engineering  techniques.  The  methodology  was  based  on  the  approach  used  to  train 
fighter  pilots.  The  knowledge  acquisition  process  followed  a  series  of  qualification  training 
sessions  and  evaluation  check-rides.  Esing  this  technique,  the  system  capabilities  increased 


Figure  A. 2.  HAV  System  Configuration  (11:10) 
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in  an  ordered  fashion.  The  knowledge  bases  became  hierarchical  as  new  concepts  were  built 
on  old  skills.  Each  level  of  competency  was  validated  before  moving  to  the  next  training 
level  (12:28). 

The  knowledge  representation  used  to  develop  the  knowledge  bases  for  the  expert 
systems  is  a  combination  of  TI  Dallas  Inference  Engine  (TII)IE).  Automated  Reasoning 
Tool  (ART),  and/or  Lisp  representations.  These  three  tools  are  used  to  represent  the 
knowledge  bases  at  three  different  levels  of  abstraction. 

TIDIE  was  developed  for  the  RAN’  project  to  provide  a  readable,  high-level  knowl¬ 
edge  representation  for  piloting  rules.  The  TIDIE  representation  consists  of  OBJECTS, 
representing  the  aircraft  state  variables;  NEEDS,  representing  the  goal  to  be  met:  and 
PLAN'S,  representing  how  the  goal  is  to  be  met  (12:12).  NEED  and  PLAN  macros  expand 
these  high  level  structures  into  lower-level  ART  components. 

Inference  Corporation's  ART  is  a  commercial  artificial  intelligence  shell.  It  provides 
an  inference  engine  for  rule-based  reasoning.  ART  also  uses  the  Rote  algorithm  for  pattern 
matching.  The  ART  representation  consists  of  typical  expert  system  RULES  and  facts 
organized  into  frame  structures,  called  SCHEMAS  (12:13).  A  Lisp-based  version  of  ARE 
is  employed  in  the  RAN’  architecture. 

All  three  levels  of  knowledge  representation  are  compatible.  TIDIE  is  implemented  in 
ART  rules  and  Lisp  runctions,  thus  allowing  free  intermixing  of  TIDIE  and  ART  constructs 
in  a  single  knowledge  base  source  file.  Lisp  was  used  directly  for  low  level  command  modules 
like  the  NT’S  (12:12). 

The  TI  architecture  has  demonstrated  the  feasibility  and  practicality  of  a  Robotic 
Air  Vehicle  (RAN’).  The  entire  ensemble  of  subsystems  has  proven  the  effectiveness  of 
distributed  cooperating  expert  systems.  The  RAN’  provides  a  performance  benchmark 
for  a  near  real-time  control  system  (12:35).  Unfortunately,  current  hardware  support  for 
symbo'ic  computing  such  as  that  used  by  the  PES  is  not  adequate  to  permit  real-time 
control  of  a  vehicle  by  an  expert  system  ( 12:33). 

In  the  final  report,  'II  recommended  three  main  follow-on  research  directions  to 
be  pursued:  first,  investigate  how  the  RAN’  prototype  would  handle  a  more  high  fidelity 
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simulation;  second,  investigate  real-time  expert  systems,  distributed  expert  systems,  and 
maintenance  and  verification  of  expert  systems;  and,  third,  extend  the  piloting  capabilities 
currently  employed  by  the  RAN’  prototype  (12:35). 
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Appendix  B.  Parallf.1  Processing  Architect  tires 


Computer  architectures  may  be  classified  by  many  metiiods,  or  taxonomies.  Flynn’s 
Taxonomy  is  based  on  the  concepts  of  instruction  stream  and  data  stream  (8:1901).  An 
instruction  stream  is  a  sequence  of  instructions  performed  by  a  computer.  A  data  stream 
is  a  sequence  of  data  used  to  execute  an  instruction  stream. 

The  Single-Instruction  stream.  Single  Data  stream  (SISD)  category  includes  most 
serial  computers.  The  Single-Instruction  stream,  Multiple  Data  stream  (S1MD)  category 
includes  processor  arrays.  The  Multi  pie- Inst  ruction  stream,  Multiple  Dal  a  stream  (  M IM  I) ) 
category  contains  most  multiprocessor  systems.  Finally,  no  computers  in  common  use 
today  belong  in  the  Multiple-Instruction  stream,  Single  Data  stream  (MISD)  category 
(31:16).  This  leaves  SIMD  and  MIM1)  as  the  two  main  categories  of  parallel  processors. 

As  implied  by  the  label,  an  SIMD  parallel  system  has  multiple  processors  operating 
the  same  instruction  synchronously  on  separate  data  streams.  Examples  of  SIMD  archi¬ 
tectures  are  the  ILI.1AC  IV  and  Connection  machines.  The  M1MD  parallel  system  has 
multiple  processors  capable  of  operating  on  multiple  data  streams  with  different  opera¬ 
tions  asynchronously.  Examples  of  MIMD  architectures  are  the  Butterfly  and  thp  iPSC 
hypercube  (20). 

There  are  many  possible  processor  organizations,  or  interconnection  method n.  for  par¬ 
allel  architectures.  The  following  are  examples  of  commonly  used  interconnection  methods 
(31:25-29): 


•  Mesh  network  -  Processors  are  arranged  into  a  q-dimensional  lattice.  Communication 
is  possible  only  bi  .ween  neighboring  nodes,  thus  interior  nodes  can  communicate  with 
2q  other  processors  (see  Figure  B.I). 

•  Pyramid  network  -  Processors  in  a  pyramid  network  of  size  p  form  a  complete  4- 
ary  rooted  tree  of  height  logA p  augmented  with  additional  interprocessor  links  that 
allow  processors  in  every  tree  level  to  form  a  two-dimensional  mesh  network  (see 
Figure  B.2). 
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•  Shuffle- Exchange  Network  -  This  network  consists  of  n  =  2k  processors  and  two  kinds 
of  connections:  shuffle  and  exchange.  Exchange  connections  Jink  processors  whose 
numbers  differ  in  their  least  significant  bit.  The  shuffle  connection  links  processor  i 
with  processor  2i  modulo  (n-1)  (see  Figure  B.3). 

•  Butterfly  Network  -  (k+l)2fc  processors  are  divided  intok+1  rows,  or  ranks,  contain¬ 
ing  n  =  2k  processors  each.  Each  processor  has  four  connections  to  other  processors 
(see  Figure  B.4). 

•  Hypcrcube  ( Cube-Connected )  Network  -  A  cube-connected  network  is  a  butterfly  with 
its  columns  collapsed  into  single  processors.  The  network  consists  of  n  =  ‘2k  processors 
forming  a  k-dimensional  hypercube.  Two  processors  are  adjacent  if  their  labels  differ 
in  exactly  one  bit  position  (see  Figure  B.5). 


Figure  B.l.  2-D  Mesh  Network  (34:26) 

Another  important  feature  of  a  parallel  architecture  is  the  memory  organization 
employed.  Most  reported  parallel  SIMD  architectures  assume  a  shared  global  memory 
among  all  processors  (34:30).  MIMD  architectures,  however,  can  be  further  classified  as 
multiprocessors  or  multicomputers  based  on  the  memory  organization. 


An  MIMD  multiprocessor  is  characterized  by  shared  memory  among  the  processors. 
Iri  a  tightly  coupled  multiprocessor,  these  processors  work  through  a  central  switching 
mechanism  to  reach  the  shared  global  memory.  A  loosely  coupled  multiprocessor  is  also 


Figure  D.2.  Size  16  Pyramid  Network  (3-4:27) 


Figure  B.3.  8-Node  Shuffle- Exchange  Network  (34:27) 
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Figure  B.4.  32- Node  Butterfly  Network  (34:28) 


Figure  B.3.  4-1)  Hypercube  Network  (34:2k) 


characterized  by  a  shared  address  space,  but  this  shared  address  space  is  formed  by  com¬ 
bining  the  local  memories  of  the  processors  (34:35-38). 


An  MIME)  multicomputer  has  no  shared  global  memory.  Instead,  each  processor 
has  its  own  local  memory.  Process  cooperation  occurs  either  through  message  passing  or 
through  memory  shared  between  pairs  of  processors  (34:41). 

Two  parallel  architectures  are  of  interest  in  this  research  investigation.  The  first  is 
a  network  of  four  Texas  Instruments  (TI)  Explorer  II  Lisp  machines.  The  second  is  the 
second  generation  Intel  Personal  Supercomputer  (iPSC/2).  Each  is  an  M1MD  architecture. 

The  first  host  architecture  of  the  RAV  expert  system  consists  of  a  network  of  four 
TI  Explorer  II  Lisp  machines.  They  form  a  loosely  coupled  system.  Each  system  has  its 
own  local  memory  and  a  common  bus  structure  connects  the  systems.  All  systems  share 
a  central  file  server.  Each  of  the  four  systems  are  very  powerful  Lisp  processors. 

The  second  architecture  of  interest  is  the  Intel  iPSC/2  hypercube.  This  system  can 
be  configured  with  up  to  128  processor  elements  (PE).  The  iPSC/2  is  a  multicomputer 
made  up  of  Intel  8038G  processors.  The  flexibility  of  this  system,  along  with  its  ready 
accessibility,  make  it  a  useful  tool  for  this  study. 

Examples  of  the  previously  discussed  architectures  are  given  in  Figure  B.6.  Only  the 
TI  Explorer  Lisp  machine  and  the  iPSC/2  are  evaluated  as  part  of  this  research  investiga¬ 
tion  due  to  availability. 
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SIMD  Processor  Arrays: 

ILLIAC  IV 

Connection  Machine 

Burroughs  PEPE 

IBM  GF-11 

ICL/DAP 

MIMD  Multiprocessors: 
(tightly  coupled ) 

C.mmp  (Carnegie-Mellon) 

Encore  Multimax 

Sequent  Balance  8000 

MIMD  Multiprocessors: 
(loosely  coupled ) 

Cm*  (Carnegie-Mellon) 

BBN  Butterfly 

MIMD  Multicomputers: 

Intel  iPSC 

Ametek  S/14 

NCUBE/10 

Appendix  C.  Tuning  Analysis  of  RA\  Expert  System 


One  key  to  the  amount  of  processing  required  to  perform  a  state  update  in  a  Rete 
network  during  the  match  step  is  the  number  of  condition  elements  (CEs)  present  near 
the  top  of  the  network.  Each  CE  may  be  considered  an  initiator  of  a  path  or  set  of  paths 
into  the  Rete  network.  As  a  network  path(s)  initiator,  a  CE  defines  a  fixed  number  of 
productions  that  can  potentially  be  affected  by  the  information  flow  down  that  path(s).  A 
CE's  span  of  effect  is  important  to  this  timing  analysis  because  match  speed-up  available 
from  production  parallelism  is  proportional  to  the  average  number  of  affected  productions 
( 1-1: 51 ). 

When  a  working  memory  element  (WME)  change  is  input  at  the  root  node  of  a  Rete 
network,  the  information  flow  resulting  from  the  processing  of  that  WME  change  passes 
through  one  or  more  CEs  and  down  through  some  number  of  nodes  in  the  network.  Cali 
the  time  from  the  initial  receipt  of  a  WME  change  at  the  root  node  to  the  termination  of 
information  flow  caused  by  that  WME  change  the  match  filter  time.  Call  the  time  required 
by  the  interpreter  to  select  a  production  from  the  updated  conflict  set  the  local  select  time. 
Similarly,  call  the  time  required  by  the  interpreter  to  evaluate  the  RHS  of  the  selected  rule 
and  to  send  it  to  the  root  node  in  the  network  the  local  act  tune. 

The  term  “local”  implies  that  the  specified  activity  is  performed  using  the  data  (i.e. 
conflict  set.  Rete  network)  that  are  saved  in  the  memory  that  is  local  to  the  processor 
running  the  interpreter.  Although  this  definition  is  assumed  for  serial  processors,  it  is 
useful  later  in  the  context  of  multicomputers. 

Consider  the  very  simple  example  case  in  which  the  WME  change  input  at  the  root 
node  of  the  Rete  network  matches  only  one  CE  at  a  subsequent  t-const  node.  A  production 
system  interpreter  implemented  on  a  serial  processor  will  complete  a  match-select-act  cycle 
in  the  amount  of  time  defined  by  the  following  (in  order-of  notation): 


O  (  match  filter  time  +  local  select  ti.ae  +  local  act  time  ) 


In  the  opposite  extreme  case,  assume  a  WME  change  input  at  the  root  node  matches 
all  of  the  CEs  at  the  subsequent  t-const  nodes.  A  serial  production  system  interpreter  will 
complete  one  match-select-act  cycle  in  the  following  time: 

0(  [stmia,i<:  (match  filter  time)]  +  local  select  time  +  local  act  time  ) 

Now  cons  er  the  production  system  implemented  on  a  multicomputer  configured  as 
in  the  design  presented  in  this  research  investigation.  Call  the  time  required  by  processors 
to  perform  the  gray-code  compare/exchange  of  productions  in  the  select  step  the  scUrt 
com  jxire  /exchange  lime.  Call  the  time  required  to  broadcast  to  al'  other  processors  the 
production  selected  for  firing  the  act  brrxidcast  time. 

Returning  to  the  simple  example  case  in  which  the  WME  change  input  matches  only 
one  CE,  the  parallel  production  system  interpreter  will  complete  one  cycle  in  the  following 
time: 

0([maxpE  (match  filter  time  +  local  select  time)]  + 
select  compare/exchange  time  + 
act  broadcast  time  4-  local  act  time  ) 

This  equation  states  that  the  processor,  or  processing  element  (PE),  that  requires 
the  most  time  to  update  its  local  Rete  subnetwork  and  to  select  a  production  from  its  local 
conflict  set  will  make  the  processors  that  have  completed  these  steps  wait  before  entering 
the  select  compare/exchange  step.  Of  course,  the  processor  that  hosts  the  Rete  subnetwork 
that  holds  the  single  affected  CE  will  experience  the  maximum  match  filter  time.  Note 
that  the  local  act  time  does  not  appear  as  maxpp  ( local  act  time)  because  all  processors 
are  assumed  to  enter  the  local  act  step  synchronously  as  a  result  of  the  act  broadcast  and 
because  the  local  act  step  will  require  the  same  amount  of  time  on  each  processor. 

omparing  the  results  of  serial  versus  parallel  processing  of  the  single-afTected-CE 
rase,  both  systems  experience  the  same  match  filter  time  and  the  same  local  act  time. 
The  parallel  system  could  experience  a  slight  savings  in  local  select  time  because  the 
affected  Rete  subnetwork  will  always  have  a  conflict  set  smaller  than  or  equal  in  size  to 
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that  of  the  whole  Rete  network  on  the  serial  system.  Consequently,  the  processor  ^  i 
the  parallel  system  that  hosts  the  affected  Rete  subnetwork  could  potentially  perform  its 
local  select  faster  than  could  the  processor  in  the  serial  design.  But  this  potential  local 
select  time  savings  is  sure  to  be  overcome  by  the  communication  cost  incurred  during  select 
compare/exchange  and  act  broadcast.  This  finding  suggests  that  the  multicomputer  design 
is  not  as  fast  as  the  serial  processor  design  when  processing  WME  changes  that  affect  very 
few  CEs. 

Considering  the  example  case  in  which  the  WME  change  input  matches  all  CEs,  the 
parallel  production  system  interpreter  will  complete  one  cycle  in  the  following  time: 

0(  {max  pt;[sumFE  sC  Es  (match  filter  time  +  local  select  time)]}  + 
select  compare/exchange  time  + 
act  broadcast  time  +  local  act  time  ) 

Again,  some  processor  (PE)  will  require  more  time  than  the  others  to  process  all  of 
the  Rete  subnetwork  updates  required  by  its  many  CE  matches  and  to  select  a  production 
from  its  local  conflict  set  The  other  processors  will  wait  for  that  processor  to  finish  before 
entering  the  select  compare/exchange  step. 

Comparing  the  results  of  serial  versus  parallel  processing  of  the  all-CE  -affected  case, 
both  systems  experience  the  same  local  act  time.  The  parallel  system  could  experience 
a  substantial  savings  in  match  filter  time  if  th*  workloads  distributed  to  the  available 
processors  are  of  equal  computational  complexity.  Assuming  this  good  load  balance,  the 
sum  of  the  match  filter  time  saved  and  the  local  select  time  saved  in  parallel  design  can 
reasonably  be  expected  to  be  greater  than  the  sum  of  the  select  compare/exchange  time 
incurred  and  the  art  broadcast  time  incurred.  This  finding  suggests  that  the  multicomputer 
design  can  be  expected  to  perform  faster  than  the  serial  processor  design  when  processing 
WME  changes  that  affect  many  uniquely  assignable  CEs. 

The  conclusion  reached  by  this  analysis  is  that  the  performance  of  th°  parallel  pro 
duction  system  design  proposed  in  this  research  investigation  depends  upon  the  knowledge 


base  of  the  application  of  interest.  For  this  design  to  perform  well,  the  application  pro¬ 
duction  system  must  exhibit  the  following  qualities: 

1.  T'he  RHSs  of  all  production  rules  must,  on  the  average,  affect  many  CFs.  Ideally, 
the  average  number  of  CFs  affected  would  equal  the  number  of  processors  available. 

2.  The  production  rules  must  lend  themselves  to  fortuitous  assignments  to  unique  pro¬ 
cessors.  Specifically,  the  production  rules  containing  CFs  that  initiate  concurrently 
filterable  Rete  network  paths  should  be  assignable  to  unique  processors. 


Appendix  I).  ParnlUl  RA\  Expert  System  Program 


The  appendix  presents  the  actual  C  language  code  implementing  the  HyperCLIPS 
shell.  Note  that  only  the  original  CLIPS  routines  adapted  for  HyperCLIPS  are  included 
here.  Routines  not  shown  in  this  appendix  are  used  in  HyperCLIPS  in  their  original  CLIPS 
for  m . 

/•it********************************************************************** 

DATE:  11/17/89 
VERSION:  1.0 

TITLE:  CLIPS  loader  for  iPSC/2  Hypercube 
FILENAME:  HOST.C 
AUTHOR:  Capt  William  A.  Harding 
COORDINATOR:  R.  Norris 

PROJECT:  Hypercube  Expert  System  Shell  -  Application  of 
Production  Parallelism  (Thesis) 

OPERATING  SYSTEM:  UNIX  System  V/386  R3.0  (for  iPSC/2) 

LANGUAGE:  C 

FILE  PROCESSING:  Compile  and  link  with  chost.def,  stdio.h 
CONTENTS: 

main  -  executive  module 

power  -  exponentiation  subroutine 

FUNCTION: 

This  program  prompts  the  user  for  the  dimension  of  the  cube  to  be 
used  for  the  expert  system  shell.  It  then  loads  processors  ith 
the  knowledge  base  represented  in  user-entered  files.  The  host 
then  prompts  for  the  run  limit  (number  of  rules  to  fire  before 
stepping)  and  the  watch  option  desired  (to  display  rules,  facts, 

1)1 


fp  =  fopen  ("watch. out", "w") ; 

fprintf  (fp , "\nCLIPS  EXPERT  SYSTEM  SHELL  FOR  THE  iPSC/2  HYPERCUBE\n\n" 
); 

fprintf  (fp,"  Written  by  Capt  William  A.  Harding"); 

/tit:***#*********************************#*******#************************/ 

/*  Get  cube  dimension  from  user  and  send  to  nodes  */ 

/************************************************************************/ 

printf  ("  Enter  cube  dimension  (1-5):  ")  ; 
scanf  ("'/.d"  ,ftdim)  ; 

csend  (DIM. TYPE,  ftdim,  sizeof (dim) ,  ALL. NODES,  NODE.PID) ; 
fprintf  (fp,"\n\n**  Dimension  */,d\n"  .dim) ; 

/I***************,,.,,***************************************************/ 

/*  WM  loop:  Prompt  for  the  names  of  files  holding  facts  and  templates  */ 
/*  and  send  these  so  each  node  can  initialize  its  copy  of  working  memory*/ 
/*  The  loop  waits  for  the  nodes  to  load  a  file  before  requesting  another*/ 

/**************.,*****************************************************•***/ 

dummy  =1;  /*  dummy  -  no  meaningful  value  */ 

num.nodes  =  power(2 ,dim) ; 

printf  ("\n  Enter  fact  file  names  for  all  processors:  "); 
printf  ("\n  ->  ") ; 
scanf  ("7,s"  ,  sname)  ; 

while  ((infile  =  fopen(sname,  "r"))  !=  NULL)  { 
fclo8e( infile) ; 

for  (i  =  0-  i  <  num.nodes;  i**)  { 


or  all  affected  during  the  run) .  Run  times  are  then  collected  from 
the  nodes,  as  well  as  any  "watch"  results,  and  these  are  displayed. 

******#***«*******************************************:*******************/ 

# include  "/usr/ipsc/lib/chost . def " 

#include  <stdio.h> 

•define  NODE.PID  1  /*  Node  process  id 

•define  HOST.PID  1  /*  host  process  id 

•define  ALL_N0DES  -1  /*  all  active  nodes  in  the  cube 

•define  ALL.PIDS  -1  /*  all  active  processes  in  the  cube 

•define  D0NE_TYPE  0  /*  type  of  done  message 

•define  DIM.TYPE  10  /*  type  of  dimension  message 

•define  FILE. TYPE  20  /*  type  of  filename  message 

•define  LIM.TYPE  30  /+  type  of  run  limit  message 

•define  ITEM.TYPE  40  /*  type  of  watch  item  message 

•define  ACT.TYPE  50  /*  type  of  watch  action  value  message 

•define  FIRE.TYPE  60  /*  type  of  rules  fired  message 

•define  TIME.TYPE  70  /*  type  of  time  message 

•define  REPORT.TYPE  80  /*  type  of  report  signal  message 

•define  GO.TYPE  90  /*  type  of  synch  start  message 

•define  TIME.SIZE  (sizeof (long) )  /*  size  of  time  message  in  bytes 

main  () 


dim , 

/* 

dimansion  of  cube 

*/ 

num.nodes , 

/* 

number  of  nodes  in  current  cube  dim 

*/ 

i. 

/* 

iteration  counter  (thru  processors) 

*/ 

dummy , 

1* 

dummy  variable  (done  messages) 

*/ 

run. limit , 

/* 

run  limit  for  desired  run 

*/ 

watch.action.value , 

/* 

f *ns  watch  option  ON  or  OFF 

*/ 

*/ 

*/ 

*/ 

*/ 

*/ 

* / 
*/ 
*/ 
*/ 
*/ 
*/ 
*/ 
*/ 
*/ 
*/ 


check.f irings , 

/* 

start , 

/* 

rules.f ired; 

/* 

float 

longest.time , 

/* 

rules.per.sec ; 

/* 

long 

run.time ; 

/* 

char 

fname[l5] , 

/* 

sname [15] , 

/* 

watch_item[15]  ; 

/* 

FILE 

*fp, 

/* 

♦inf ile ; 

/* 

checks  consistency  of  rules  fired  */ 

signifies  1st  node’s  responses  coming*/ 
number  of  rules  fired  during  run  */ 

maximum  run  time  of  all  nodes  */ 

run’s  average  rules  fired  per  second  */ 

run  time  returned  from  a  node  */ 

loadfile  name  */ 

setfile  name  */ 

item  to  display  using  watch  option  */ 

file  pointer  */ 

input  file  pointer  */ 


/*  Load  the  cube  */ 


setpid  (HC  .T.PID)  ; 

load  ("node",  ALL. NODES,  NODE.PID)  ; 

/I***********************************************************************/ 

/*  Print  welcome  message  */ 

/*************************************************-, . .  «****************/ 

printf  ("\n  CLIPS  EXPERT  SYSTEM  SHELL  FOR  THE  iPSC/2  HYPERCUBE\n\n" 


); 

printf  (" 


Written  by  Capt  William  A.  Harding\n\n") ; 


fp  *  fopen  ("watch. out" ,"w") ; 

fprintf  (f p , "\nCLIPS  EXPERT  SYSTEM  SHELL  FOR  THE  iPSC/2  HYPERCUBE\n\n" 
); 

fprintf  (fp,"  Written  by  Capt  William  A.  Harding"); 

/ ******************************* ***************************************** / 
/*  Get  cube  dimension  from  user  and  send  to  nodes  */ 

/************************************************************************/ 

printf  ("  Enter  cube  dimension  (1-5):  "); 
scanf  ("'/.d"  ,4dim) ; 

csend  (DIM. TYPE,  &dim,  sizeof(dim),  ALL. NODES,  NODE.PID) ; 
fprintf  (fp,"\n\n**  Dimension  ’/.d\n"  .dim)  ; 

/************************************************************************/ 

/*  WM  loop:  Prompt  for  the  names  of  files  holding  facts  and  templates  */ 
/*  and  send  these  so  each  node  can  initialize  its  copy  of  working  memory*/ 
/*  The  loop  waits  for  the  nodes  to  load  a  file  before  requesting  another*/ 

/ *************** **********************  *********************** ************ / 

dummy  =1;  /*  dummy  -  no  meaningful  value  */ 

num.nodes  =  power(2 ,dim) ; 

printf  ("\n  Enter  fact  file  names  for  all  processors:  "); 
printf  ("\n  •>  ")  ; 
scanf  ("’/,s",  sname)  ; 

while  ((infile  =  fopen(sname,  "r"))  !=  NULL)  { 
fclose(inf ile) ; 

for  (i  =  0;  i  <  num.nodes;  i++)  { 


csend(FILE_TYPE ,  sname,  sizeof (sname) ,  i,  NODE.PID); 
crecv(DuNE_TYPE ,  ftdummy ,  sizeof (dummy) ) ; 


> 

printf  ("  ->  "); 
scanf  C'/.s",  sname); 

> 

f close(inf ile)  ; 

for  (i  =  0;  i  <  num.nodes;  i++)  { 

csend(FILE_TYPE,  sname,  sizeof (sname) ,  i,  NODE.PID);  > 

/ ******^*********************************************************** ****** / 
/*  PM  loop:  Prompt  for  each  node’s  rule  file  names  and  send  these  to  */ 

/*  the  nodes  so  they  can  access  the  files  to  load  their  rule  bases.  */ 

/*  The  loop  waits  for  the  node  to  load  a  file  before  requesting  another.*/ 
/************************************************************************/ 

for  (i  =  0;  i  <  num.nodes;  i++)  { 

printf  ("\n  Enter  rule  file  names  for  processor  ’/.d  :  ",  i)  ; 
printf  ("\n  ->  ") ; 
scanf  ("7,s",  fname) ; 

while  ((infile  =  fopen(fname,  "r"))  !=  NULL)  { 
f close(inf ile)  ; 

csend(FILE_TYPE,  fname,  sizeof (fname) ,  i,  NODE.PID); 
crecv(DONE_TYPE ,  ftdummy,  sizeof (dummy) ) ; 
printf  ("  ->  "); 
scanf  ('7.s",  fname); 

> 

fclose(inf ile) ; 

csend(FILE._TYPE,  fname,  sizeof  (fname) ,  i,  NODE.FID) ; 

> 


/.ft***********************************************************************/ 

/*  Prompt  for  run  options  (run.limit,  vatch.item,  watch_action_value)  */ 
/*  and  send  these  to  all  nodes.  This  serves  as  the  nodes’  "GO"  signal.  */ 

/ft**********************************************************************/ 

printf  ("\n  Enter  run  limit  (-1  for  no  limit):  "); 
scanf  ("'/.d",  ftrun.limit); 

csend(LIM_TYPE ,  ftrun.limit,  sizeof (run.limit) ,  ALL.NODES,  NODE.PID) ; 

printf  ("\n  Enter  watch  item  (lower  case):  "); 
scanf  ("’/.s",  watch.item); 

csend(ITEM.TYPE,  vatch.item,  sizeof (watch.item) ,  ALL_N0D£S,  NODE.PID); 

printf  ("\n  Enter  watch_action_value  (l-ON,  O-OFF) :  "); 
scanf  ("'/.d"  ,  &watch_action_value) ; 
printf  ("\n\n  Executing. . .  \n\n"); 

csend(ACT_TYPE ,  fcwatch_action_value ,  sizeof (watch.action.value)  , 
ALL.NODES,  NODE.PID)  ; 

for  (i  =  0;  i  <  num.nodes;  i  +  +  )  { 

crecv (DONE.TYPE ,  fcdummy ,  sizeof (dummy)) ;} 

csend(GO_TYPE ,  dummy,  sizeof (dummy) ,  ALL.NODES,  NODE.PID) ; 

/****#***♦**********************#****************************************/ 
/*  Get  run  data  from  nodes,  compute  and  display  the  results.  */ 

/************************************************«***********************/ 
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longest.time  =  0.00; 

start  =1;  / *  true  -  first  time  thru  */ 

for  (i  *  0;  i  <  num.nodes;  i++)  { 

csend(R£PORT_TYPE ,  dummy,  sizeof (dummy ) ,  i,  NODE.PID) ; 

crecv(FIRE_TYPE,  ftrules.fired,  sizeof (rules.fired) )  ; 
if  (start  ==  1)  { 

check_f irings  =  rules.fired; 

start  =  0;  /*  false  -  first  time  only  */ 

} 

else  { 

if  (check_f irings  !=  rules.fired) 

printf  ("\n**  ERROR  ->  rules  fired  discrepancy  **\n") ; 

} 

crecv (TIHE.TYPE ,  trun.time,  TIME. SIZE) ; 
if  (longest.time  <  (f loat)run_time/1000 .00) 
longest.time  =  (float)run_time/1000.00; 


rules.fired--;  /*  subtract  out  setup  rule  firing  */ 
rules.per.sec  =  (f loat)rules_f ired  /  longest.time; 

printf  ("\n**  Run  Completed  after  */,d  rule  f irings. \n",  rules.fired) 
fprintf  (fp,"\n**  Run  Completed  after  */,d  rule  f  irings.  \n", 


rules.fired) ; 


printf  ("\n**  Total  Run  Time  =  */,0.5f  seconds.  \n",  longest.time) ; 
fprintf  (fp,"\n**  Total  Run  Time  =  */,0.5f  seconds. \n",  longest.time) ; 


printf  ("\n**  Average  for  Run  =  '/,0.5f  rules/second . \n" , 

rules_per_sec) ; 

fprintf  (fp,"\n**  Average  for  Run  =  y.0.5f  rules/second .  \n" , 

rules_per_sec) ; 


f close  (fp) ; 


killcube  (ALL.NODES ,  ALL_PIDS) ; 

} 


power(base,  exp) 
int  base, exp; 

{ 


int  answer; 

for  (answer  =  1;  exp  >  0;  exp--) 
answer  =  answer  *  base; 
return(answer) ; 

> 


I  >8 


DATE:  11/17/89 
VERSION:  1.0 


TITLE:  CLIPS  Node  for  iPSC/2  Hypercube 
FILENAME:  NODE.C 
AUTHOR:  Capt  William  A.  Harding 
COORDINATOR:  R.  Norris 

PROJECT:  Hypercube  Expert  System  Shell  -  Applying 
Production  Parallelism  (Thesis) 

OPERATING  SYSTEM:  NX/2  Node  executive  (for  iPSC/2) 

LANGUAGE:  C 

FILE  PROCESSING:  Compile  and  link  with  host.c,  cnode.def,  stdio.h, 
msgs.h,  and  clips. h 

CONTENTS: 

main  -  executive  module 

power  -  exponentiation  subroutine 

FUNCTION: 


This  program  receives  the  dimension  of  the  cube  to  be  used  for  the 
expert  system  shell  and  determines  which  nodes  will  remain  active. 

It  then  initializes  the  CLIPS  expert  system  shell  and  loads  the 
knowledge  base  from  user-entered  files  (at  the  host).  Each  node 
then  receives  the  run  limit  (number  of  rules  to  fire  before 
stopping)  and  the  watch  option  desired  (to  display  rules,  facts, 
etc.  affected  during  the  run).  Run  times  are  sent  to  the  host, 
as  wel1  as  any  "watch"  results  (saved  as  a  file  at  the  host). 

******* ******************************************************************/ 

•include  "/usr/ipsc/lib/cnode . def " 


l)f* 


•include  <stdio.h> 


•include  "clips. h" 
•include  "msgs.h" 


•define 

HOST.NID  myhostO 

/* 

host  node  id 

*/ 

•define 

HOST.PID 

1 

/* 

host  process  id 

*/ 

•def ine 

NODE.PID 

1 

/* 

Node  process  id 

*/ 

•define 

ALL. NODES 

-1 

/* 

All  nodes’  ids 

*/ 

•define 

DONE.TYPE 

0 

/* 

type  of  done  message 

*/ 

•def ine 

DIM. TYPE 

10 

/* 

type  of  dimension  message 

*/ 

•def ine 

FILE.TYPE 

20 

/* 

type  of  filename  message 

*/ 

•def ine 

LIM.TYPE 

30 

/* 

type  of  run  limit  message 

*/ 

•def ine 

ITEM, TYPE 

40 

/* 

type  of  watch  item  message 

*/ 

•def ine 

ACT.TYPE 

50 

/* 

type  of  watch  action  value  message 

*/ 

•def ine 

FIRE.TYPE 

60 

/* 

type  of  rules  fired  message 

*/ 

•define 

TIME. TYPE 

70 

/* 

type  of  time  message 

*/ 

•define 

REPORT.TYPE 

80 

/* 

type  of  report  signal  message 

*/ 

•define 

GO .TYPE 

90 

/* 

type  of  synch  start  message 

*/ 

•define 

TIME.SIZE 

(sizeof (long) )  /*  size  of  time  message 

*/ 

. . . 

/*  Global  Variables 

(seen  also 

in  files:  engine. c  and  factmngr.c) 

*/ 

. . . . . 

int 

dim, 

/* 

cube  dimension 

*/ 

my.node , 

/* 

my  node  numbe1- 

*/ 

my.pid , 

/* 

my  node  process  id 

*/ 

num.nodes , 

/* 

number  of  nodes  in  current  cube  dim 

*/ 

RHS.type , 

/* 

type  of  action  message 

*/ 

RUNNING; 

/* 

global  flag  -  master  rule  firing  node*/ 

DIO 


long  start.time;  /*  start  time  of  run  */ 


RHS_msg  my_RHS_buf ;  /*  global  msg  -rule  firing  mechanism  */ 


main() 


/*  Local  Variables 

*/ 

{ 

int 

dummy , 

/* 

dummy  variable  (done  messages) 

*/ 

run.limit , 

/  + 

run  limit  for  desired  run 

*/ 

wat ch_ act ion_ value , 

/* 

turns  watch  option  ON  or  OFF 

*/ 

msg.type, 

/* 

message  type  buffer 

*/ 

rules.f ired; 

/* 

number  of  rules  fired  during  run 

*/ 

long 

/*  start.time, 

start  time  of  run 

*/ 

run.time ; 

/* 

total  time  of  run 

*/ 

char 

fname[15]  , 

/ 

dfile  name 

*/ 

sname [15] , 

/  + 

setfile  name 

*/ 

watch_item[15] ; 

/* 

item  to  display  using  watch  option 

*/ 

FILE 

*f  1 ; 

/* 

file  pointer 

*/ 

/*  each  node  define  its  node  number  and  pid  */ 


U  1 1 


my  .node  =  mynodeO; 
my.pid  *  mypid( ) ; 


/************************************************************************/ 
/*  receive  dimension  from  host,  compute  number  of  nodes,  &  start  CLIPS  */ 
/************************************************************************/ 

crecv  (DIM.TYPE,  &dim,  sizeof (dim) ) ; 
num.nodes  =  power (2 , dim) ; 
if  (my.node  >-  num.nodes)  exit  (0); 

init. clips ()  ; 

RUNNING  =  FALSE;  /*  no  nodes  firing  rules  yet  */ 

/**»**«** »***************************************************************/ 

/*  WM  loon:  Receive  from  the  host  names  of  files  holding  facts/templates*/ 
/*  and  read  these  to  initialize  the  local  copy  of  working  memory  (facts)*/ 
/*  Continue  this  loop  for  all  fact  file  names  sent  from  the  host.  */ 

/***************************, v  ******************************  ■;*************/ 

/*  set_conserve("on") ;  on  -  pprule  info  not  kept  */ 
dummy  =1;  /*  dummy  -  no  meaningful  value  */ 

crecv (FILE.TYPE,  sname ,  sizeof (sname)) ; 
while  ((fl  =  fopen(sname,  "r"))  !=  NULL)  { 
fcloBe(fl) ; 
load.rules (sname)  ; 

csend(DONE.TYPE,  dummy,  sizeof (dummy ) ,  HOST. NIP,  my.pid); 
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crecv(FILE_TYPE ,  sname ,  sizeof (sname) ) ; 

> 

fcluse(fl) ; 

/«***«*******************************************************************/ 

/*  PM  loop:  Receive  rule  file  names  from  the  host  and  access  the  files  */ 
/*  at  the  host  to  load  the  local  rule  base.  Continue  this  loop  for  */ 
/*  all  rule  file  names  sent  from  the  host.  */ 

/************************************************************************/ 

crecv (FILE.TYPE ,  fname,  s izeof (f name) ) ; 
while  ( Cf 1  =  fopen(fname,  "r"))  !=  NULL)  { 
f close(f 1)  ; 
load_rules (fname) ; 

csend(DONE_TYPE ,  dummy,  sizeof (dummy ) ,  HQST.NID,  my.pid)  ; 
crecv(FILE_TYPE,  fname,  sizeof (fname) ) ; 

> 

f close(fl) ; 

/************************************************************************/ 

/*  Receive  run  options  from  host  (run.limit,  watch.item,  */ 

/*  watch.action.value).  Then  reset  CLIPS  environment  and  start  the  run.*/ 

/ft.**********************************************************************/ 

creov(LIM_TYPE ,  ftrun.lirait,  sizeof (run.limit) ) ; 
crecv(ITEM.TYPE,  watch.item,  sizecf (watch.item) ) ; 
crecv(ACT_TYPE,  ftwatch.action.value ,  sizeof (watch.action.value) )  ; 

set_watch(watch.item,  watch.action.value) ; 


D  n 


reset.clipsO  ; 


csend(DONE_TYPE ,  dummy,  sizeof (dummy) ,  HOST.NID,  my.pid); 
crecv(GO_TYPE ,  idummy,  sizeof (dummy) ) ; 

rules.fired  =  run(run.limit) ; 
run. time  =  mclockO  -  start.time; 

/*  displayfactsO ;  *  enable  if  test  desired  */ 

/************************************************************************/ 
/*  Send  run  data  for  this  node  to  the  host  */ 

/************************************************************************/ 

crecv (REPORT_TYPE ,  & dummy ,  sizeof (dummy) ) ; 

csend(FIRE_TYPE ,  fcrules.f ired ,  sizeof (rules.fired) ,  HOST.NID,  my.pid) ; 
csend(TIME_TYPE ,  ftrun.time,  TIME.SIZE,  HOST.NID,  my.pid); 

} 

/*******************♦****************************************************/ 
/*  Subroutine  Power  */ 

/**********************♦***»*************** ****+*****************+*******/ 

power  (base,  exp) 
int  base,  exp; 

{ 

int  answer; 

for  (answer  =  1;  exp  >  0;  exp--) 
answer  =  answer  *  base; 
retum(answer)  ; 

> 


DU 


~l 


/*  USRFUNCS :  The  function  which  informs  CLIPS  of  any  user  */ 
/*  defined  functions.  In  the  default  case,  there  are  no  */ 
/*  user  defined  functions.  To  define  functions,  either  */ 
/*  this  function  must  be  replaced  by  a  function  with  the  »/ 
/*  same  name  within  this  file,  or  this  function  can  be  */ 
/*  deleted  from  this  file  and  included  in  another  file.  */ 
/*  User  defined  functions  may  bs  included  in  this  file  or  */ 
/*  other  files .  */ 
/*  Example  of  redefined  usrfuncs:  */ 
/*  usrfuncsO  */ 
/♦  {  */ 
/*  def ine.f unction("funi" , ’ i ’ ,funl , "funl") ;  */ 
/*  def ine_function("other" , ’f ’ .other , "other") ;  */ 
/-  >  */ 


usrfuncsO 

{ 

> 
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/*  HyperCLIPS  Version  1.0  11/17/89  */ 


/*******************************************************/ 

/+  NOTE!  :  the  following  routines  are  compiled  within  */ 
/*  "engine. c".  Only  new  routines  added  for  HyperCLIPS  */ 
/*  or  original  CLIPS  routines  adapted  for  HyperCLIPS  */ 
/*  are  shown  here.  Other  routines  not  shown  here  must  */ 
/*  still  be  compiled  in  their  original  CLIPS  form  */ 

/*  within  "engine. c"  */ 

/ ******************************************************* / 

/***  ^ ******************************************** ***m*** / 
/*  "C"  Language  Integrated  Production  System  */ 

/*  ENGINE  MODULE  */ 

/*******************************************************/ 

•include  <stdio.h> 


•include  "clips. h" 
•include  "engine. h" 
•include  "msgs.h" 


•define 

ALL. NODES 

-1 

/* 

All 

nodes’  ids 

*/ 

•def ine 

BUFFER.SIZE 

(sizeof (gray.msg) ) 

/* 

size 

of 

message 

buf f «r 

*/ 

•def ine 

RHS.SIZE 

(sizeof (RHS.msg) ) 

/* 

size 

of 

RHS  msg 

buffer 

*/ 

/****************************************/ 
/*  GLOBAL  INTERNAL  FUNCTION  DEFINITIONS  */ 
/****************************************/ 
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int 

rur ( ) ; 

int 

add.activationC) ; 

int 

remove.all. activations  0 

int 

set.agenda.countO  ; 

int 

set.activations.watchC) ; 

int 

get.activations.watchO  ; 

int 

purge.agendaO  ; 

int 

cl ear .rule .from. agenda () 

int 

pnnt.activationO  ; 

struct  activation 

♦get.next.activationO  ; 

int 

set.execution.error () ; 

int 

get.execution.error ( ) ; 

int 

get.change.agendaO  ; 

int 

set.change.agendaO  ; 

int 

global.selectO  ; 

int 

get.partner.nodeO  ; 

int 

slave.runO ; 

struct  fact 

*ptr_to() ; 

/***************************************/ 
/*  LOCAL  INTERNAL  VARIABLE  DEFINITIONS  +/ 
/ *********************  .  ***  ******** t***f 


static  struct  activation 
static  long  int 
static  int 
static  int 
static  char 

Btatic  struct  exec.func 
static  int 


♦AGENDA  =  NULL; 

AGENDA. COUNT  =  0; 
vatch.activations  =  OFF; 
EXECUTION.ERROR  =  FALSE; 
♦currentrule  *  NULL; 
♦exec.list  *  NULL; 
change.agenda  =  FALSE; 
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static  int 


static  int 
static  int 
gray.msg 

int 


executing  =  FALSE; 
NULL.SALIENCE  =  -10001; 
i  =  0; 
my.msg.buf , 
part.msg.buf ; 
rules.fired  =  0; 


/*  GLOBAL  EXTERNAL  VARIABLE  DEFINITIONS  */ 


extern  RHS.msg  my.RHS.buf ; 


extern  struct  fbind  *gbl_lhs_ binds ; 

extern  struct  fbind  *gbl_rhs_binds ; 

extern  int  dim,  /*  dimension  of  cube  */ 

my. node,  /*  node  number  */ 

my.pid,  /*  node  pid  */ 

num.nodes,  /*  number  of  nodes  in  cube  */ 

RHS.type,  /*  type  of  action  message  */ 

RUNNING;  /*  node  running  flag  */ 

extern  long  start.time;  /*  start  time  hack  */ 


/**************************************************************/ 

/*  RUN:  Begins  execution  of  rules.  If  run  limit  is  less  than  */ 
/*  zero,  then  rules  will  be  executed  until  all  nodes ’ agendas*/ 
/*  are  empty.  If  run  limit  is  greater  than  zero,  then  rules*/ 
/*  will  be  executed  until  either  all  agendas  are  empty,  the  */ 
/*  run  limit  has  been  reached,  or  a  rule  execution  error  */ 
/*  has  occured.  Returns  the  number  of  rules  fired.  */ 
/**************************************************************/ 
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run(run.limit) 


int  run.limit ; 

{ 

struct  test  *commands; 
struct  fbind  *local_vars; 
struct  activation  *rule_to_f ire ; 
char  print.space  [20] ; 
struct  values  result ; 
struct  exec.func  *exec_ptr; 

RHS.msg  done.RHS.buf ; 

done_RHS_buf .type  =  DONE; 
strcpyCdone.RHS.bui.fact.str,  EOS) ; 
done.RHS.buf . index  -  -1; 

/*  Fire  rules  until  all  agendas  empty,  the  run  limit  */ 
/*  has  been  reached,  or  a  rule  execution  error  occurs.  */ 

EXECUTION .ERROR  =  FALSE; 
executing  =  TRUE; 

global.select ( )  ; 

while  ( (my.msg.buf . salience.of .rule  !=  NULL.SALIENCE-'  44 
(run.limit  !=  0)  4S 
(EXECUTION .ERROR  ==  FALSE)) 

{ 
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rules.f ired++ ; 

if  (run.limit  >  0)  {  run_limit — ;  } 
if  (my.msg.buf . chosen.node  ==  my.node)  \ 

RUNNING  =  TRUE; 

RHS.type  =  my.node  +  num.nodes; 

/*  Bookkeeping  and  Tracing.  +/ 
/*==========================*/ 

currentrule  =  AGENDA->rule ; 

if  (get.rules.watchO  ==  ON) 

{ 

sprintf (print.space , "FIRE  ‘/,4d  "  .rules.f  ired) ; 
cl.print ("wtrace" ,print_space) ; 
cl_print("wtrace" .currentrule) ; 
cl.print ("wtrace" :  "); 

print.f act .basis ("wtrace" , AGENDA- >bas is- >binds) ; 
cl.print ("wtrace" ,"\n") ; 

> 

if  (get.crsv.trace.watchO  ==  ON) 

sprintf  (print.space ,  "F  ’/,-4d  "  .rules.f  ired)  ; 

cl_print("vcrsv_tr" .print.space) ; 
cl_print("wcrsv_tr" .currentrule) ; 
cl .print ("wcrsv.tr" ,"\n") ; 
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> 


/*  Execute  the  rule’s  right  hand  side  actions.  */ 

/*=============================================,/ 


change.agenda  =  TRUE; 
rule.to.fire  =  AGENDA; 
commands  =  AGENDA->actions ; 
local.vars  =  AGENDA->basis->binds ; 

AGENDA  =  AGENDA->next ; 

gbl.lhs.binds  =  local.vars; 
gbl.rhs.binds  =  NULL; 

commands  =  commands->arg_list ; 

while  ((commands  !=  NULL)  &&  (EXECUTION.ERROR  ==  FALSE)) 

r 

i 

gener4  c_ compute (commands ,4result) ; 
commands  =  commands->next_arg ; 

> 

commands  =  NULL; 


/*========================================*/ 

/*  Return  the  agenda  node  to  free  memory.  */ 

/,========================================,/ 


ratumbinds (local.vars) ; 

rtn. struct (f link , rule.to.f ire->basis); 

rtn.struct (act i vat  ion ,rule_to_f ire) ; 
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/*X  =  =  =  =  =  =  =  =  =  =  =  =  =  =  =  =  =  =  =  =  =  =  ==  =  =  =  =  =  =  =:  =  =  =  =  =  =  =  =  */ 

/*  Tell  other  nodes  my  RHS  firing  is  done  -/ 

/*========================================,/ 

csend(RHS_type ,  Stdone.RHS.buf  ,  RHS.SIZE,  ALL.NODES,  my.pid) 
RUNNING  =  FALSE; 

> 

else  { 

slave.runO  ; 

> 


/*  Remove  retracted  facts,  ephemeral  symbols,  */ 
/*  variable  bindings,  and  temporary  segments.  */ 


rmv_old_facts()  ; 
rem_eph. symbols () ; 
f lush.bind.list  () ; 
f lush.aegments ( ) ; 


/*  Execute  exec  list  after  performing  actions.  */ 
/*»»«.=  =======================*==============*/ 

exec.ptr  =  exec.list; 


while  (exec.ptr  !=  NULL) 

{ 
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fl 


Ocxoc.ptr  :  ip)  () ; 
exec.ptr  =  exec.ptr->next ; 

> 

global .select') ; 

} 

executing  =  FALSE; 

EXECUTION. ERROR  =  FALSE; 
return(rules_f ired) ; 

} 


/*  Giobal.Select  :  Sends  salience  of  top  rule  on  local  */ 

/*  AGENDA  for  comparison.  Receives  salience  of  RHS  */ 

/«  to  compute  and  id  of  node  holding  selected  rule  ♦/ 

global. select ( ) 

{ 

int  partner. node , 
count ; 

if  (rules. fired  ==  1)  {  start. time  =  mclockO;  >  /*  start  time  hack  */ 

if  (AGENDA  ==  NULL)  { 

my.msg.buf . salience.of .rule  =  NULL.SALIENCE ; 

> 

else  { 

By.msg.buf  . sal ience.of .rule  =  AGENDA->salience; 

> 
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my .msg.buf . chosen.node  =  my.node; 


/*  Do  gray-code  compare-exchange  down  to  node  0  */ 

for  (count  =  0;  count  <  dim;  count**  )  { 

partner. node  =  get_partner_node(my_node ,  (power (2 , count) ) ) 

if  (my .node  >  partner .node) 

r 

{ 

csend  ''my  .node  ,  4my.msg.buf, 

BUFFER. SIZE, 
partner. node ,  my.pid); 

> 

else  { 

crecv (pai tner.node  ,  S part .msg.buf , 

BUFFER. SIZE)  ; 

if  (part.msg.buf . salience. of .rule  > 

my.msg.buf . salience. of .rule) 

{  my.msg.buf . salience. of .rule  = 

part.msg.buf.salience.of. rul e; 
my.msg.buf . chosen.node  =  part.msg.buf . chosen. node 

} 

> 

} 


/*  broadcast  overall  best  msg  to  all  nodes  »/ 
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if  (my.node  ( -•  0)  { 

crecv(0,  4my.msg.buf,  BUFFER. SIZE)  ; 

> 

else  { 

csend(0,  4my.msg.buf,  BUFFER. SIZE,  ALL.NQDES,  ir.y.pid)  ; 

} 


> 


/****%  i 


(***«****«*******•*** 


/«  Get.Partner.Nnde :  Returns  gray-code  partner  node  for  message  passing  */ 


/  *  * « < 


'*****»*»***»»*/ 


ge* .partner.node  (this.node,  xor. value) 
int  this.node, 
xor. value ; 

{ 

int  binary.code; 

bir.ary.code  =  (gray (this.node) )  *  (gray (xor.value) )  ; 
return(ginv(binary.code) ) ; 

} 


/*  SLAVE.RUN:  Receives  assertion/retraction  to  perform  from  chosen  node 


/ 

/ 


i  *  +  *  *  / 


slave_run( ) 

{ 


struct  fact  *ptr; 
int  ay.RHS.type; 
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my.RHS.type  =  my.msg.buf . chosen.node  +  num.nodes; 
crecv(my_RHS_type .  ftmy.RHS.buf,  RHS.SIZL) ; 


while  (my.RHS.buf . type  !=  DONE}  { 
switch  (my.RKS.baf .type)  { 

case  ASSERT  :  a~sert (my.RHS.buf . fact.str) ; 

it  (rules.fired  ==  1)  { 

print f  ("I'/.d :  \"'/,s\"\n"  ,  ray. node , 
my.RHS.buf  .fact.sf)  ; 

> 

break ; 

case  RETRACT  :if  ((pf  =  ptr. to (my.RHS.buf . index) )  !=  NULL) 
{  retract.f act  (ptr) ;  } 
break ; 

default  :  printf ("ERROR  in  switch  Btmt  type\n");  break; 

> 

crecv(my_RHS_type ,  ftmy.RHS.buf,  RHS.SIZE'!  ; 


/*  PTR.TO:  Returns  pointer  to  the  fact  with  the  specified  index  */ 

struct  fact  *ptr_to(index) 

FACT.ID  index; 

{ 

struct  fact  *ptr; 

ptr  =  get.next.f act(NULL) ; 
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while  (ptr  !=  NULL) 


{ 

if  (ptr->ID  ==  index) 

{  return(ptr)  ;  } 
ptr  =  get_next_f act (ptr) ; 

> 

return(NULL) ;  /*  fact  not  found  */ 

> 

/*  HyperCLIPS  Version  1.0  11/17/89  */ 

/#******************************************************/ 

/*  NOTE!  :  the  following  routines  are  compilad  within  */ 
/*  "f actmngr . c" ,  Only  new  routines  added  for  */ 

/*  HyperCLIPS  or  original  CLIPS  routines  adapted  for  */ 
/*  HyperCLIPS  are  shown  here.  Other  routines  not  shown*/ 
/*  here  must  still  be  compiled  in  their  original  CLIPS  */ 
/*  form  within  "factmngr.c"  */ 

/*******************************************************/ 

/***•***************************************************/ 
/*  "C"  Language  Integrated  Production  System  */ 

/*  FACT  MANAGER  MODULE  */ 

/*******************************************************/ 

•include  <stdio.h> 

•include  <string.h> 

•include  "setup. h" 

•include  "msgs.h" 
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I 


♦include  "clips. h" 
♦include  "scanner. h' 


1 

1 

♦define  ALL.NODES  -1  /*  All  nodes’  ids 

♦define  RHS.SIZE  (sizeof (RHS.msg) )  /*  size  of  RHS  msg  buffer 

*/ 

*/ 

I 

1 

/****************************************/ 

/*  GLOBAL  INTERNAL  FUNCTION  DEFINITIONS  */ 

/********************«*******************/ 

1 

1 

struct  fact  *get_el(); 

struct  fact  *add_fact(); 

char  *build_str() ; 

■ 

1 

■ 

/♦♦♦♦a*******************************#***/ 

/*  GLOBAL  EXTERNAL  FUNCTION  DEFINITIONS  */ 

/****************************************/ 

■ 

1 

1 

extern  struct  draw  *add_symbol ( ) ; 

extern  struct  element  *fast_gv(); 

extern  struct  pat.node  *network_pointer () ; 

extern  char  *symbol_string() ; 

1 

1 

/***************************************/ 

/*  LOCAL  INTERNAL  VARIABLE  DEFINITIONS  */ 

/•***************** ,*********************/ 

1 

1 

1 

static  struct  fact  *garbage_f acts  =  NULL; 

static  struct  fhash  *f act.hashtable [SIZE.FACT.HASH] ; 

static  int  watch_f acts ; 
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static  struct  fact 


*f actlist ; 
static  struct  fact  *last_fact; 
static  long  int  ID; 

static  int  change.facts  =  FALSE; 

/****************************************/ 

/*  GLOBAL  EXTERNAL  VARIABLE  DEFINITIONS  */ 
/************★***************************/ 


extern  int 


extern  RHS.msg 
extern  struct  fbind 
extern  struct  fbind 
extern  struct  funtab 


my.node , 
my.pid , 
RHS.type , 
RUNNING; 
my.RHS.buf ; 
*gbl_lhs_binds ; 
*gbl_rhs_binds ; 
♦PTR.GET.VAR; 


/*************************************************************/ 
/*  RETRACT.FACT :  Retracts  a  fact  from  the  fact  list  given  a  */ 
/*  pointer  to  the  fact.  */ 

/*************************************************************/ 
retract _f act (fact _ptr) 
struct  fact  *fact_ptr; 

{ 

FACT.ID  fact.num; 
struct  fact  *temp_ptr; 
struct  match  *match_list; 
char  print.space [20] ; 
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/*==*===================================================*/ 

/*  Check  to  see  if  the  fact  has  already  been  retracted.  */ 

/*===X==*==============================================i;*/ 

if  (RUNNING  ==  TRUE) 

{ 

temp_ptr  =  fact.ptr; 
my_RHS_buf . type  =  RETRACT; 
my_RHS_buf . index  =  temp_ptr->ID ; 
strncpy (my.RHS.buf . fact _str, EOS , 

strlen(my_RHS_buf . f act.str) ) ;/ *not  used*/ 
isend(RHS_type ,  Stmy_RHS_buf ,  RHS.SIZE,  ALL.NODES,  my_pid) ; 

> 

temp.ptr  =  garbage.! acts ; 
while  (temp.ptr  !=  NULL) 

{ 

if  (temp.ptr  ==  fact.ptr) 

{  return(O);  > 
temp.ptr  =  temp_ptr->next ; 

> 


/♦=========================================*/ 

/*  Show  retraction  if  facts  being  watched.  */ 
/♦==™™™===========™====== . */ 

fact.num  *  f act_ptr->ID ; 

if  (watch.facts  ==  ON) 
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{ 

cl_print("wtrace" ,"<==  "); 
shov.f act ( "vtrace" ,f act.ptr) ; 
cl.print ("wtrace" , "\n") ; 

> 

if  (get.crsv.trace.watchO  ==  ON) 

{ 

cl.print ("wcrsv.tr" , "R  "); 

sprintf (print.space,  "7,-51d  "  ,f act_ptr->ID) ; 

cl .print ("w crsv.tr " .print .space) ; 
print. element("wcrsv_tr" ,&(fact_ptr->atoms [0] )) ; 
cl .print ("wcrsv.tr" , "\n") ; 

> 

change.facts  =  TRUE; 

/*=====================================*/ 

/*  Delete  the  fact  from  the  fact  list.  */ 
/*=====================================*/ 

del.hash.f act (f act.ptr) ; 

/*  Save  the  list  of  pattern  matches.  */ 
match.list  =  f act.ptr->list ; 

if  (fact.ptr  ==  last.fact) 

{  last.fact  =  f  act_ptr->previous ;  > 

if  (f act_ptr->previous  ==  NULL) 

{ 
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/*  Delete  the  head  of  the  fact  list.  */ 
factlist  =  f actlist->next ; 
if  (factlist  ! =  NULL) 

{  f actlist->previous  =  NULL;  } 

> 

else 

{ 

/*  Delete  a  fact  other  than  the  head  of  the  fact  list.  */ 
f act_ptr->previous->next  =  f act_ptr->next ; 
if  (f act_ptr->next  !=  NULL) 

{  f act_ptr->next->previous  =  f act_ptr->previous ;  } 

> 

temp_ptr  =  garbage.f acts ; 
garbage_facts  =  fact.ptr; 
f act_ptr->next  =  temp.ptr; 

/*  Loop  through  the  list  of  all  the  patterns  that  */ 

/+  matched  the  fact.  */ 

match. retract (mat ch.l ist .fact _num) ; 

/*  Remove  all  activations  that  contain  this  fact  */ 

/*  from  the  agenda.  */ 

purge.agenda(fact.num) ; 
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return(l) ; 


> 


/«*********•*************«********************************************/ 

/*  ADD.FACT:  Places  a  fact  onto  the  end  of  the  fact  list  and  calls  */ 

/*  compare  to  filter  the  fact  through  the  pattern  network.  Returns  */ 

/*  null  if  the  fact  was  already  in  the  knowledge  base,  and  a  */ 

/*  pointer  to  the  fact  if  it  was  not  in  the  knowledge  base.  */ 

/**************************»******************************************/ 
struct  fact  *add_f act (new_f act) 
struct  fact  *new_fact; 

{ 

mt  hash_value; 
struct  fact  *temp_fact; 
char  print_space[20] ; 


/«========================================================*/ 

/*  If  fact  assertions  are  being  checked  for  duplications,  */ 
/*  then  search  the  fact  list  for  a  duplicate  fact.  */ 

/*  =  =  =  =  =  =  =  ===  =  =  =  =  =  =  =  =  =  =  ==  =  =  =  =  =  =  =  ==..====  ===  =  =  ===  =  =  =  =  =  ==  =  =  =  =  =:*/ 


hash.value  =  hash_f act (new. fact) ; 

if  (f act_exists(naw_f act .hash.value)  ==  1) 

{ 

rtn_«l(new_fact) ; 
return  (MULT.)  ; 

> 
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if  (RUNNING  ==  TRUE) 


{ 

temp. fact  =  new.fact; 

build.str (temp.f act) ; 

my.RHS.buf .type  =  ASSERT; 

my.RHS.buf . index  *  -1;  /*  not  used  */ 

isend (RHS.type ,  ftmy.RHS.buf ,  RHS.SIZE,  ALL.NODES ,  my.pid) 

> 

add.hash.f act (new.f act .hash.value) ; 

fact  to  the  fact  list.  Set  the  ID  for  the  */ 

/*  fact  and  install  the  symbols  used  by  the  fact  in  */ 

/*  the  symbol  table.  */ 

new.f act->next  =  NULL; 
new.fact  >ii0t  =  NULL; 
new.f act->previous  =  last.fact; 
if  (last.fact  ==  NULL) 

{  factlist  =  new.fact;  > 
else 

{  last.f act->next  =  new.fact;  } 
last.fact  =  new.fact; 


ID**; 

new_fact->ID  *  ID; 
fact.install (new.f act) ; 
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/*=«»■===============:====================--=-====*/ 

/*  Indicate  the  addition  of  the  fact  to  the  fact  */ 
/*  list  if  facts  are  being  watched.  */ 
/+===============================================*/ 


if  (vatch_facts  ==  ON) 

{ 

cl. print ("wtrace" , "==> 
show.f act ("wt race" .new.fact); 
cl_print( "wtrace" , "\n") ; 

I 

if  (get_crsv_trace_watch()  ==  ON) 

cl_print("wcrsv_tr" ,"AS  "); 

sprintf  (print.space ,  "*/,-51d  ("  ,new_f  act->ID) ; 

cl_print("wcrsv_tr" .print.space) ; 

show_elements("wcrsv_tr" .new.fact) ; 

cl. print ("wcrsv.tr"  ,")")  ; 

cl .print ("wcrsv.tr" , "\n" ) ; 

> 

change.facts  =  TRUE; 


/*==============================================*/ 

/*  Filter  the  fact  through  the  pattern  network.  */ 


compare (new.fact ,new_f act->atoms .network.pointer () , 1 ,0 .NULL, NULL) 
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retum(new_f act)  ; 


> 


/***************************************************/ 

/*  Build.Str:  */ 

/I**************************************************/ 

char  *build_str (f act.ptr) 
struct  fact  *fact_ptr; 

{ 

struct  element  *sublist,  *elem_ptr; 
char  *num_to_string() ; 
int  length,  i; 

length  =  f act_ptr->f act.length; 
sublist  =  f act_ptr->atoms ; 

strncpy (my_RHS_buf . f act_str , "" ,strlen(my_RHS.buf . f act.str) ) ; 

for  (i  =  0;  i  <  length;  i++) 

{ 

elem.ptr  =  tsublist[i]; 

if  (elem_ptr->type  ==  NUMBER) 

{ 

strcat (my.RHS.buf .fact.str, 

num_to_string(elem_ptr->val .fvalue)) ; 

> 

else  if  (elem_ptr->type  ==  WORD) 
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{ 

strcat (my _RHS_buf.fact.str, 

s ymbcl. string (el  em.pt r->val . hvalue) ) 

} 

else  if  (elem_ptr->type  ==  STRING) 

{ 

strcat (my.RHS.buf .f act.str ,  "\  '")  ; 
st r cat (my .RHS.buf . f act.str , 

symbol .string (el  em.pt r->val .hvalue) ) 
strcat (my .RHS.buf . fact _str ,  ; 

> 

if  (i  ♦  1  !=  length) 

{  strcat(mv. RHS.buf .fact.str,  "  ");  } 

} 

} 
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at  five  node  prot  essors.  1  he  node  program  oti  each  node  processor  ini t  iaii/fs  1! y per( '  1  1 1  ’ S 
for  tint'  professor,  load-  fact  s  an<l  templates  fr<>tn  fiost  files  const  it  ut  i rig  the  initial  wot  k:  ng 
it  minor  v  h >r  that  process ,r  ( same  on  all  i.  and  loads  t he  node  processor’s  rub's  from  spei  i fed 
host  files  -'each  rule  partition  should  fie  unique).  After  then  resetting  H v  per*  1.11'S  and 
;n  1 1  vat  i  ng  ijt--  i  red  runtime  watch  options,  the  nodes  start  off  synchronously  execu  t  ing  t  he 
match  step  m  the  mat i  h  select  act  cycle.  The  identical  working,  memory  is  maintained  on 
all  node  processor's  throughout  execution-. 

/  ‘i  HyperCI.ll’S  Hu.  sic  (  \irb  uf  1,  ft  rut  wit 

The  basic  production  system  algorithm  is  executed  by  H yper< '  I.I  PS  as  a  Match 


*se|e<  r  -Act  cycle  in  the  following  order: 


I .  Mat  ■  ! i  1  ,n  ii  j)  ri  h  i  .~s,  >r  e  v,i 1  u  .1 t  •  -  1  { 1  •  ■  I  I  { S >  1  >1  !  h •  ■  1  <  1  <  .1 !  pr<  m  !  u  i  t  n  a.  r  u  '•  ■  -  t  <  >  •  i 
which  are  '  i  - 1 1 « ■«  i  given  t  h  •  ■  1 1  j  r  r  *  ■  1 1 1  <  <»tii<,ni  -  of  it  -  !m  a  I  *  <  >  j  >  ■.  of  w .  .r  k  1  ng 

Satisfied  production  rub's  arc  riitcro!  into  the  local  conflict  -et  t . .  r  t!..it  j 
I  ho  start,  rule  1-  matched  f  1  r > t  liicaioc  it  has  an  empty  IMS 

S' 'i cci  1  ach  processor  chooses  t !;.  pile  with  the  highest  ode-me  from 
c  o  1 1 1 :  i  ■  t  set.  li  none  of  a  processor’s  production  rules  have  sat  i- fn-d  [.US-.  , 
less  than  the  r:iiu tin  11  t:i  ;>< >"i )>!»■  Ilypert  'I. IPS  salience  i-  u-.  ,i  .,-g  md; 

I  he  processors  perform  a  grav -code  mnipare/exchanue  of  •  a>  |.  pro,. —  ,r 


diei.ee  and  t i.at  pr 


■J  1  h\  per.  U  he.  the  Too 


is  guaranteed  to  hold  the  I, idlest  salnmie  of  ativ  candidat*  r..d  at  d  '  :  e 
process,  .r  t  ha'  hosts  that  rule  after  d  compare, 'exchange.-  lie  r-  •• ,!  f  . 

processors  'he  selected  rule  s  salience  and  host  ing  processor.'  lip  s,  t  •  ■ , 
ho-; ;  r  1  o  proi  I'.ssor  s  state  to  t  hat  of  at  t  ion  master  atui  -et : , g  • ; .  .  ; 
pr.  ..  ess, ,rs'  to  t  i.at  of  act  ion  s[„\e 

■  i  .V •  I  l.e  action  master  processor  ex«>(  utes  a  normal .  ~<-r.a’  (  I  II**'  p 
the  si'ji'f  t e.j  rule’s  HHS.  resultin'.’  in  Mime  combination  of  fa.  '  .  a.- 


and  ret  ract  ions  f  t.v  fact  IDc  1  hese  assertions  and  r*tr; 


a c 1  -  at 


the  master  processor  to  all  slave  processor-,  each  of  win. !,  * !,.  :.  a- 
desired  fact  into/out  of  its  local  working  metnorv. 

1.  If  a  termination  condition  is  detected  I>\  all  processor-.  1  • 

the  user.  Possible  termination  conditions,  include  t  lie  hide --t  r ne¬ 


on  any  processor  bring  the  null  salience  fj.e.  no  rules  are  -at 


I'tie.l  o’  ,t  ■ 


or  t lie  run  limit  having  been  reached.  If  no  termination  londi 
Step  1  f  M at  cli  j 


I  i  l)<  taih  (i  l)i>uju 

As  mentioned  above.  Ilypert  I. ITS  i-  developed  using  serial  <1  IT'  a-  an  emi,. 
program.  The  host  and  node  programs  replace  the  original  main  pr*igrain  defim-d  f..r 
Cl. IPS.  To  maintain  as  much  of  the  serial  Cl. IPS  inodularitv.  fntn  tion.ihu.  and  mteg* 


possible,  HyperCLIPS  is  accomplished  using  minimal  adaptation  of  original  CUPS  source 
code.  Specifically,  only  the  run()  routine  (in  CLIPS  file  “engine. c")  and  the  add.fnct()  and 
retract. fact  ()  routines  (in  CLIPS  file  “factmngr.c" )  are  altered  from  their  original  serial 
representation.  Some  extra  service  routines  are  added  to  these  same  two  files  to  implement 
HyperCLIPS  (see-  the  HyperCI.lPS  source  code  in  these  two  files  for  details). 

5  f'mbrdd ir i g  Ifyjx  rCL IPS 

Like  serial  CLIPS,  the  HyperCLIPS  shell  may  be  used  as  an  embedded  program. 
The  basic  approach  to  embedding  HyperCLIPS  for  use  by  another  application  program 
on  the  iPSC/2  is  to  ])  edit  the  host  and  node  programs  as  needed  to  perform  the  higher- 
level  application  and  2)  call  HyperCLIPS  using  the  function  calls  normally  used  to  control 
embedded  CLIPS. 

A  certain  minimum  set  of  function  calls  must  be  made  when  executing  HyperCLIPS 
as  an  embedded  program.  First,  init .clips ()  must  be  called  prior  to  any  other  CLIPS 
function  to  initialize  the  HyperCLIPS  environment.  Second,  all  deduct .  deftemplate,  and 
defrule  statements  (in  order)  must  be  loaded  as  the  initial  application  knowledge  base  using 
the  load.rulc  s( )  function.  Third,  rc.  act  .clips  ()  must  be  called  to  reset  the  HyperCLIPS 
environment,  thus  removing  all  activations  from  the  conflict  set  and  all  facts  from  working 
memory,  and  asserting  all  facts  listed  in  deffacts  statments  ;nto  working  memory.  Finally, 
the  top-level  program  must  call  run()  to  initiate  expert  system  execution  (allowing  rules 
to  fire).  All  other  CLIPS  functions  available  when  embedding  CLIPS  are  available  when 
embedding  HyperCLIPS. 
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Appendix  F.  Hyp'  rCl.IPS  1st  r.s  ’  Manual 
F.l  HyperCLl PS  Or<rri<ir 

HyperCLIPS  is  a  parallel  expert  system  shell  designed  to  execute  on  the  Intel  Sec¬ 
ond  Generation  Personal  Supercomputer  (iPSC/2).  HyperCLIPS  is  an  adaptation  of  the 
serial  ('  Language  integrated  Production  System  (CLIPS  Version  1.3)  developed  by  the 
.Artificial  Intelligence  Section  (A1S)  at  NASA/Johnson  Sparc  Center  (JSC  ).  HypejCI.IPS 
is  designed  as  a  research  tool  for  determining  the  processing  speedup  attainable  through 
parallel  processing  of  a  given  expert  system  application. 

Much  of  the  original  source  code  for  t  he  serial  version  of  CLIPS  is  used  in  the  Hyper - 
CLIPS  implementation.  Therefore,  those  interested  in  using  HyperCLIPS  must  first  ac¬ 
quire  CLIPS.  CLIPS  Version  1.3  is  available  through  the  Computer  Software  Management 
and  Information  Center  (COSMIC),  which  is  the  distribution  point  for  NASA  software. 
Further  information  ran  be  obtained  from 

COSMIC 

382  E .  Broad  St . 

Athens ,  GA  30602 
(404)  542-3265 

The  additional  source  code  and  setup  files  for  HyperCLIPS  are  available  upon  request 
to 

HyperCLIPS 

c/o  Dr.  Gary  Lamont 

Department  of  Computer  and  Electrical  Engineering 
Air  Force  Institute  of  Technology 
Wr ight-Patterson  AFB ,  OH  45433 

F.2  Requirrmrufs  for  Running  lhjjx  rf’LIPS 

To  prepare  the  HyperCLIPS  program  for  execution,  you  must  first  compile  it  using 
the  Greenhilis  C  (or  compatible)  Compiler  arid  link  it  with  the  iPSC/2's  run-time  system 


and  libraries.  The  iPSC/2  currently  uses  the  ('.MX  System  V/386  operating  system  on 
the  front-end  host  processor  and  Node  eXecutive  (NX/2)  for  each  of  the  node  processors 
in  the  hypercube.  Tsing  the  provided  setup  files  mentioned  above  to  compile  and  link, 
the  executable  file  produced  for  the  HyperCLIPS  shell  is  labelled  host.  Typing  host  at  the 
UNIX  system  prompt  will  begin  execution  of  HyperCLIPS. 

Your  own  CLIPS-syntax  expert  system  program  must  satisfy  a  few  requirements 
before  it  can  be  compiled  and  executed  by  HyperCLIPS.  vll  fact  and  template  declara¬ 
tions  and  all  productions  must  be  contained  in  one  or  more  files,  and  no  declarations  and 
productions  can  exist  in  the  same  file.  All  declarations  files  must  be  input  before  any 
productions  files  are  input.  To  initialize  the  working  memory  for  your  execution,  one  of 
your  productions  must  be  a  start  production  which  has  no  conditions  in  its  LHS  and  which 
has  your  desired  list  of  initializing  assertions  in  its  RMS.  The  following  is  a  sample  start 
production: 

1  d(  fruit  start  : 

(assert  (SWITCH  (STATUS  ON))) 

(assert  (SORTIE  (SET  MCCHORD))) 

( assert  (TAKEOFF  (PLAN  ROGER)))) 

Although  this  sample  start  production  happens  to  also  be  named  “start",  any  valid 
CLIPS  production  name  will  do.  When  execution  of  your  program  begins,  the  start  pro¬ 
duction  will  match  and  fire  before  any  other  productions.  If  you  desire  not  initialize  your 
working  memory  with  the  start  production,  simply  leave  ihe  start  production’s  RHS  empty. 

F.3  Interface  to  HyperCLIPS 

HyperCLIPS  does  not  maintain  the  high-level  user  interface  provided  by  CLIPS 
Version  4.3.  Instead,  HyperCLIPS  leads  the  user  through  a  series  of  prompts  to  allow 
input  of  some  key  expert  system  shell  commands  required  for  execution  to  proceed. 

After  the  startup  welcome  message,  the  first  HyperCLIPS  prompt  requests  the  de¬ 
sired  dimension  of  the  hypercube: 
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Enter  cube  dimension  (1-5): 


This  prompt  allows  the  user  to  specify  the  number  of  processor  nodes  to  be  applied  to 
solution  of  his  expert  system  application.  Note,  however,  that  the  response  to  this  prompt 
specifies  the  dmu  nsion  of  the  cube,  not  the  number  of  nodes  in  the  cube.  For  example,  a 
response  of  "S'  causes  the  activation  of  23,  or  eight,  processors  in  the  cube. 

Next.  HyperCLIPS  asks  for  the  list  of  fde  names  containing  the  fact  and  template 
declarations  files.  The  prompt  is  of  the  following  form: 


Enter  fact  file  names  for  all  processors: 
-> 


The  user  can  enter  any  number  of  file  names  for  loading,  terminating  entry  of  declarations 
files  with  an  "x“  (or  any  invalid  file  name).  The  facts  and  templates  specified  in  these  files 
are  loaded  onto  each  of  the  active  processors. 

Loading  of  the  productions  files  follows  a  process  similar  to  that  of  loading  decla¬ 
rations  files,  except  that  HyperCLIPS  prompts  for  the  files  to  be  loath'd  to  each  active 
processor  individually,  as  follows: 


Enter  rule  file  names  for  processor  P  : 
-> 


During  actual  execution  of  HyperCLIPS,  the  “P”  shown  in  this  sample  prompt  is 
replaced  with  the  ID  of  the  processor  currently  being  loaded  with  productions  files.  Hy¬ 
perCLIPS  does  not  protect  against  the  user  entering  the  same  productions  file  into  more 
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than  one  processor,  nor  does  it  check  that  no  single  production  is  loaded  into  more  than 
one  processor.  Parsing  of  production  rules  into  production  files,  and  the  subsequent  as¬ 
signment  of  production  files  to  specific  processors,  is  left  to  the  complete  discretion  of  the 
user.  Any  number  of  file  names  may  be  loaded,  entry  of  files  again  being  terminated  with 
an  “x"  (or  any  invalid  file  name). 

The  next  prompt  requests  the  desired  run  limit  for  execution.  The  user  can  specify 
the  number  of  rules  to  fire  before  suspending  execution,  or  the  user  may  enter  -1  to  set 
HvperCLlPS  to  run  until  no  more  rules  remain  to  fire  or  until  the  program  terminates 
(whichever  occurs  first): 


Enter  run  limit  (-1  for  no  limit): 


The  final  two  prompts  allow  setting  of  the  CLIPS  watch  options  for  debugging  and 
program  verification  purposes.  The  prompts  appear  in  the  following  form: 


Enter  watch  item  (lower  case) : 


Enter  watch_action_value  (1-ON,  O-OFF): 


The  user  first  enters  a  watch  item,  which  specifies  that  “rules”,  “facts”,  “activations", 
“compilations”,  or  “all"  be  displayed  during  expert  system  shell  execution.  The  watch 
action  value  prompt  turns  the  desired  watch  option  ON  (1)  or  OFF  (0). 

The  HvperCLlPS  program  terminates  after  a  single  execution,  requiring  that  re¬ 
sponses  to  the  entire  sequence  of  prompts  described  above  be  reaccomplished  for  subse¬ 
quent  runs.  This  limitation  is  very  inconvenient  when  a  large  number  of  declarations  files 
and  productions  files  must  be  entered  for  each  execution.  The  workaround  to  avoid  re¬ 
typing  of  prompt  responses  is  to  use  an  executable  macro  file  to  initiate  HvperCLlPS  and 
provide  responses  to  the  user  prompts. 
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F-4  HyperCLIPS  Limitations 

HyperCLIPS  is  designed  as  a  parallel  processing  research  tool.  As  such,  its  simple 
interface  environment  is  not  adequate  for  expert  system  application  program  debugging. 
The  user  is  encouraged  to  debug  his  expert  system  application  using  serial  CLIPS  Version 
4.3  before  attempting  to  process  the  application  using  HyperCLIPS. 

To  achieve  processing  speedups  using  HyperCLIPS,  two  conditions  must  exist: 

1.  The  user's  application  must  lend  itself  to  fortuitous  assignment  of  production  rules 
to  available  processors  so  as  to  attain  some  benefit  from  production  parallelism. 

2.  The  user  must  employ  some  algorithmic  mechanism  to  recognize  dependencies  and 
relationships  among  the  production  rules,  upon  which  fortuitous  assignment  to  avail¬ 
able  processors  depends. 

In  regard  to  condition  1,  HyperCLIPS  can  only  take  advantage  of  the  production  paral¬ 
lelism  inherent  in  the  productions  parsed  to  its  separate  processors.  Concerning  condition 
2.  HyperCLIPS  offers  no  algorithmic  inter-production  dependency  recognition  mechanism, 
as  the  burden  of  assignment  of  productions  to  processors  is  left  to  the  user. 
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